Somewhere in here?

/***** SVAnalyse options *****/

//analyse.vectors            = 3;
//analyse.block.w            = 16;
//analyse.block.h            = 16;
//analyse.block.overlap    = 2;

Yup, block.w controls how big the blocks are horizontally, and block.h controls how 'high' the blocks are vertically.
Also, you have to remove the '//' at the start of each line you wish to have an effect (they are commented out).
Also, set overlap = 3, it will give you much less artifacts.

If you are interested, these are the settings that work really good for me, however I strongly favor smoothness over artifacts and you will need a reasonably powerful PC to use them exactly as-is:

/***** SVSuper options *****/

levels.pel                = 3;
levels.gpu                = 0; //SUBJECTIVE: Compare performance and quality with gpu=1 for yourself

levels.scale.up            = 2;
levels.scale.down            = 4;
levels.full                = true;

/***** SVAnalyse options *****/

analyse.vectors            = 3;
analyse.block.w            = 16;
analyse.block.h            = 16;
analyse.block.overlap            = 3;

analyse.main.levels            = 0;
analyse.main.search.type        = 4;
analyse.main.search.distance        = -24;
analyse.main.search.sort        = true;
analyse.main.search.satd        = true; //VERY slow if 'true', but gives slightly higher quality
analyse.main.search.coarse.type    = 4;
analyse.main.search.coarse.distance    = -24;
analyse.main.search.coarse.satd    = true;
analyse.main.search.coarse.trymany    = true;
analyse.main.search.coarse.width    = 1920; //This is for 1080p content. Use 4096 for 4k, however it may crash, run out of memory, etc.
analyse.main.search.coarse.bad.sad    = 190;
analyse.main.search.coarse.bad.range    = -38;

analyse.main.penalty.lambda        = 11.2;
analyse.main.penalty.plevel        = 1.65;
analyse.main.penalty.lsad        = 4000;
analyse.main.penalty.pnew        = 40;
analyse.main.penalty.pglobal        = 50;
analyse.main.penalty.pzero        = 80;
analyse.main.penalty.pnbour        = 30;
analyse.main.penalty.prev        = 20;

analyse.refine[0].thsad        = 170;
analyse.refine[0].search.type        = 4;
analyse.refine[0].search.distance    = -38;
analyse.refine[0].search.satd        = true;
analyse.refine[0].penalty.lambda    = 9.8;
analyse.refine[0].penalty.lsad    = 3900;
analyse.refine[0].penalty.pnew    = 50;

/***** SVSmoothFps options *****/

smooth.rate.num            = 3; //Set this to auto, or to your <TargetFrameRate>/<SourceFrameRate>
smooth.rate.den            = 1;
smooth.algo                = 23;
smooth.block                = false;
smooth.cubic                = 1;
smooth.linear                = true; //Only works with GPU rendering, i.e. use gpu=1

smooth.mask.cover            = 40; //Therse values are your main trade-off between smoothness and artifacts
smooth.mask.area            = 15; //Making them larger decreases smoothness and obtrusive artifacts
smooth.mask.area_sharp        = 1.65; // See https://www.svp-team.com/wiki/Plugins:_SVPflow for details

smooth.scene.mode            = 0;
smooth.scene.force13        = false;
smooth.scene.luma            = 1.66;
smooth.scene.blend            = false;
smooth.scene.limits.m1        = 1600;
smooth.scene.limits.m2        = 2800;
smooth.scene.limits.scene        = 3000;
smooth.scene.limits.zero        = 180;
smooth.scene.limits.blocks        = 33; //Adjust this value for scene-change detection. Higher values makes it less sensitive and might cause it to try
                                                         //interpolating scene-changes!

Okay, so what I would recommend is doing the following:

1) COMPLETELY uninstall (manually delete all directories and files that remain) ALL video players, codecs and video-processing filters.

2) To help you with the sound issue (VLC truly is an absolute abomination; I simply cannot fathom why anyone in their right mind would give up quality and control for 'convenience'), download the 32-Bit software (and the applicable prerequisites, if you don't already have them) from here:
https://imouto.my/tutorials/watching-h2 … ture-cuda/
Direct link: https://imouto.my/download/lav-filters-megamix-32-bit/

as well as AC3Filter (Lite) from here: http://www.ac3filter.net/wiki/Download_AC3Filter
Direct link: https://googledrive.com/host/0B792BWceK … b_lite.exe

and finally SVP 3 and the latest SVP-Flow libraries from here: https://www.svp-team.com/wiki/Plugins:_SVPflow
and here: http://www.svp-team.com/files/SVP_3.1.7_Core.exe
(also, if you don't have ffdshow-tryouts installed, you'll need to get that aswell: http://sourceforge.net/projects/ffdshow … e/download

3) Install the software, but DO NOT follow the configuration instructions that they recommend. Just use the default installation settings (except for selecting MPC-HC (lite) and the 'medium' MadVR preset (your GPU might be too slow for even this though, so maybe try 'low' first?)) and configure as below.

4) - Reclock: Use WASAPI Exclusive interfaces (if your sound card can handle it), PCM latency = 50%, Sampling Rate = 96kHz (again, if your sound-card can handle it), quality = Best Sinc Interpolation, format = 24 bit, tick all check-boxes except for "Use AC3 Encoding" and "Disable media speed correction with bitstream audio" (assuming you are not bitstreaming of-corse), under the "video settings" tab, set both maximum speed-up and slow-down percentages to 3%.
    - AC3Filter: Set it to not change the sample rate, use 24-bit encoding, and output to your speaker configuration (check the audio routing matrix!).
    - MPC-HC (lite): Disable ALL 'Internal Filters' and the 'Audio Switcher' and set the 'Audio Renderer' to use 'Reclock'.
Under the 'External Filters' tab, you should ONLY have the following (all set to 'prefer'): File Source (Async), LAV Splitter, LAV Audio Decoder, LAV Video Decoder, ffdshow raw video filter, AC3Filter, MadVR, Reclock.
Under the 'Output' tab, select MadVR to render 'DirectShow Video' and make sure that you selected Reclock as the 'Audio Renderer'.
    - MadVR: The most important thing here (to fist just get things working smoothly), is to go to 'rendering' -> 'general settings', and check ALL the check-boxes, EXCEPT for 'enable windowed overlay', and set CPU queue size to 32 and GPU queue size to 24.
Then go to 'exclusive mode', present at least 12 frames in advance (more may work better or worse depending on your system) and set everything to flush.
Next, go to 'smooth motion' and DISABLE smooth motion.
(Please Note: MadVR has a lot of very impotant settings that have a huge impact on the quality of video rendering, as well as on the GPU resources it demands. I can give some guidance to which settings would work the best if you need, but it is out of the scope of this post...
I would, however, STRONGLY suggest that you get a more powerful GPU or, at the very least, get an extra GPU to run SVP on exclusively (doesn't need to be very powerful, as SVP really doesn't use a lot of GPU resources (never goes above 5% for me, running 2 OC'd Titan X's)).

Finally, install SVP, overwrite the svp-flow libraries with the new ones you downloaded, and edit the override.js file as follows:

/***** SVSuper options *****/

levels.pel                = 3;
levels.gpu                = 0; //SUBJECTIVE: Compare performance and quality with gpu=1 for yourself

levels.scale.up            = 2;
levels.scale.down            = 4;
levels.full                = true;

/***** SVAnalyse options *****/

analyse.vectors            = 3;
analyse.block.w            = 16;
analyse.block.h            = 16;
analyse.block.overlap            = 3;

analyse.main.levels            = 0;
analyse.main.search.type        = 4;
analyse.main.search.distance        = -24;
analyse.main.search.sort        = true;
analyse.main.search.satd        = true; //VERY slow if 'true', but gives slightly higher quality
analyse.main.search.coarse.type    = 4;
analyse.main.search.coarse.distance    = -24;
analyse.main.search.coarse.satd    = true;
analyse.main.search.coarse.trymany    = true;
analyse.main.search.coarse.width    = 1920; //This is for 1080p content. Use 4096 for 4k, however it may crash, run out of memory, etc.
analyse.main.search.coarse.bad.sad    = 190;
analyse.main.search.coarse.bad.range    = -38;

analyse.main.penalty.lambda        = 11.2;
analyse.main.penalty.plevel        = 1.65;
analyse.main.penalty.lsad        = 4000;
analyse.main.penalty.pnew        = 40;
analyse.main.penalty.pglobal        = 50;
analyse.main.penalty.pzero        = 80;
analyse.main.penalty.pnbour        = 30;
analyse.main.penalty.prev        = 20;

analyse.refine[0].thsad        = 170;
analyse.refine[0].search.type        = 4;
analyse.refine[0].search.distance    = -38;
analyse.refine[0].search.satd        = true;
analyse.refine[0].penalty.lambda    = 9.8;
analyse.refine[0].penalty.lsad    = 3900;
analyse.refine[0].penalty.pnew    = 50;

/***** SVSmoothFps options *****/

smooth.rate.num            = 3; //Set this to auto, or to your <TargetFrameRate>/<SourceFrameRate>
smooth.rate.den            = 1;
smooth.algo                = 23;
smooth.block                = false;
smooth.cubic                = 1;
smooth.linear                = true; //Only works with GPU rendering, i.e. use gpu=1

smooth.mask.cover            = 40; //Therse values are your main trade-off between smoothness and artifacts
smooth.mask.area            = 15; //Making them larger decreases smoothness and obtrusive artifacts
smooth.mask.area_sharp        = 1.65; // See https://www.svp-team.com/wiki/Plugins:_SVPflow for details

smooth.scene.mode            = 0;
smooth.scene.force13        = false;
smooth.scene.luma            = 1.66;
smooth.scene.blend            = false;
smooth.scene.limits.m1        = 1600;
smooth.scene.limits.m2        = 2800;
smooth.scene.limits.scene        = 3000;
smooth.scene.limits.zero        = 180;
smooth.scene.limits.blocks        = 33; //Adjust this value for scene-change detection. Higher values makes it less sensitive and might cause it to try
                                                         //interpolating scene-changes!

This is the lowest quality (fastest) settings that I find to still give an acceptable smoothness, while giving a tolerable amount of artifacts (on real-life content, not anime). If you still don't get smooth playback, press 'ctrl + j' whilst playing the video and look for any dropped frames there. Dropped frames indicate that your PC isn't powerfull enough for the current configuration and you will have to shave some of the load off.

IceDreamer wrote:

I'm running at 100Hz, and have SVP set to X4 source resulting in ~95. All my sources are 1080p, monitor is running at 3440*1440

Actually, 23.976 x4 ~ 96, but still, that mismatch would create a HUGE amount of judder. Rather set your monitor to an integer multiple of 24, and then use reclock to 'scale-up' your playback speed to match 24 fps.

IceDreamer;
Sadly, there are a few reasons why we won't see real 'high quality' smooth motion in our lifetimes (which I will quickly list below).

1) Temporal Aliasing and The Aperture Problem both place strict mathematical limits on the accuracy of interpolated frames, even with an infinitely fast computer and a mathematically 'perfect' algorithm.

2) Most consumers DO NOT want movies\videos to have smooth motion (look at the backlash against The Hobbit's use of 48-fps, which is not even close to actual 'smooth' motion) and cannot be made to understand that they can just trow away all the 'extra' frames to make the motion as jittery as they would like, while we cannot just make up new frames to get the motion as smooth as we want (so even though we have had the technical feasibility to do so for decades already, the free market simply cannot sustain such media creation).

3) Gains in practical, workstation-level performance have slowed to a crawl since the end of the 2000's. So, while we do have some very high quality motion interpolation algorithms at our disposal, it will take a very long time before we will see the requisite practical (i.e. Single-Threaded performance equivalent) computational power at 50% market penetration.

However, it is still possible to at least generate an interpolated video stream with an acceptable trade-off between artifacts and smoothness (mainly because 24-fps is so incredibly bad, that even a relatively large amount of artifacts can be generated without being as distracting as trying to resolve motion at 24-fps). Unfortunately, SVP uses literally the worst (quality wise; it is also the fastest / cheapest) of the practical motion interpolation algorithms (in any case, SVP's main goal has always been to try to bring motion interpolation to the attention of the masses and not to implement the highest quality quality motion interpolation algorithms; which would be pretty expensive to develop and too demanding for most people's systems to handle).

That being said, I have found that the following produces, on average, the best PSNR (which correlates very strongly with my own visual perception, but apparently not with that of most other people) on a variety of real-life content (NOT Anime!):

/***** SVSuper options *****/

levels.pel                = 3;
levels.gpu                = 0; //SUBJECTIVE: Compare performance and quality with gpu=1 for yourself

levels.scale.up            = 2;
levels.scale.down            = 4;
levels.full                = true;

/***** SVAnalyse options *****/

analyse.vectors            = 3;
analyse.block.w            = 16;
analyse.block.h            = 16;
analyse.block.overlap            = 3;

analyse.main.levels            = 0;
analyse.main.search.type        = 4;
analyse.main.search.distance        = -24;
analyse.main.search.sort        = true;
analyse.main.search.satd        = true; //VERY slow if 'true', but gives slightly higher quality
analyse.main.search.coarse.type    = 4;
analyse.main.search.coarse.distance    = -24;
analyse.main.search.coarse.satd    = true;
analyse.main.search.coarse.trymany    = true;
analyse.main.search.coarse.width    = 1920; //This is for 1080p content. Use 4096 for 4k, however it may crash, run out of memory, etc.
analyse.main.search.coarse.bad.sad    = 190;
analyse.main.search.coarse.bad.range    = -38;

analyse.main.penalty.lambda        = 11.2;
analyse.main.penalty.plevel        = 1.65;
analyse.main.penalty.lsad        = 4000;
analyse.main.penalty.pnew        = 40;
analyse.main.penalty.pglobal        = 50;
analyse.main.penalty.pzero        = 80;
analyse.main.penalty.pnbour        = 30;
analyse.main.penalty.prev        = 20;

analyse.refine[0].thsad        = 170;
analyse.refine[0].search.type        = 4;
analyse.refine[0].search.distance    = -38;
analyse.refine[0].search.satd        = true;
analyse.refine[0].penalty.lambda    = 9.8;
analyse.refine[0].penalty.lsad    = 3900;
analyse.refine[0].penalty.pnew    = 50;

/***** SVSmoothFps options *****/

smooth.rate.num            = 3; //Set this to auto, or to your <TargetFrameRate>/<SourceFrameRate>
smooth.rate.den            = 1;
smooth.algo                = 23;
smooth.block                = false;
smooth.cubic                = 1;
smooth.linear                = true; //Only works with GPU rendering, i.e. use gpu=1

smooth.mask.cover            = 40; //Therse values are your main trade-off between smoothness and artifacts
smooth.mask.area            = 15; //Making them larger decreases smoothness and obtrusive artifacts
smooth.mask.area_sharp        = 1.65; // See https://www.svp-team.com/wiki/Plugins:_SVPflow for details

smooth.scene.mode            = 0;
smooth.scene.force13        = false;
smooth.scene.luma            = 1.66;
smooth.scene.blend            = false;
smooth.scene.limits.m1        = 1600;
smooth.scene.limits.m2        = 2800;
smooth.scene.limits.scene        = 3000;
smooth.scene.limits.zero        = 180;
smooth.scene.limits.blocks        = 33; //Adjust this value for scene-change detection. Higher values makes it less sensitive and might cause it to try
                                                         //interpolating scene-changes!

Unfortunately, your system is not even remotely powerful enough to get an acceptable real-time result, however you can still try these settings out by running an AviSynth conversion (through VirtualDub) on a test clip (i.e. Create an AviSynth script, open said script in VirtualDub, output the RAW avi, view in video player WITHOUT SVP).
Just use the above settings to create the strings as per this documentation page:
https://www.svp-team.com/wiki/Plugins:_SVPflow
It will take quite a long time, so I would recommend only parsing a hundred or so frames (use something like the AviSynth Trim or Select* commands) and letting it run overnight.

VB_SVP wrote:

Async Compute is related to HSA as it enables the GPU to be used to perform computation in lieu of the CPU without tanking the GPU's performance (and it is functionality that Nvidia GPUs don't have). With the iGPUs, so many of the Intel and AMD ones are rather useless for GPU computation as they lack the functionality outright or "pull an Nvidia" and emulate it, which lacks the performance.  Tragically, AMD took too long to put GCN on its line of APUs so all pre-2014 APUs (Kaveri) are stuck with non-async compute, non-HSA enabled GPUs based on pre-GCN architecture, even though GCN was available on their dGPUs in 2011/2012.  Compounding that brutality, even though they are quite capable of working in some capacity with Vulkan/DX12, AMD has no plans to support those APIs on pre-GCN GPUs.

Firstly, I would just like to clear up any confusion regarding Nvidia's support for asynchronous computation, see here: Nvidia supports Async Compute

I still think this whole iGPU/APU thing is a very bad idea (as long as people keep demanding they be low power as well). Think about it, a CPU has to process all the sequential program instructions (even modern software like new game engines, that completely utilize 4-cores, do so with four threads that do a lot of sequential computations), and you want it to be very good at that sort of thing to be able to keep up with modern GPUs (especially since transistor geometry scaling still brings huge performance gains to GPUs).
GPUs, on the other hand, aren't very good at dealing with things like logic expressions, branching ('if' , 'else') and nested 'for' loops. To use your example of SVP with duplicate frame detection by using the iGPU, the basic program flow of the most computationally intensive subroutine would look something like this:

>Get the current frame, the next frame, and their representations as a plane of blocks.

>Now get the 'quality' of the current-to-next and the next-to-current motion vectors by running a convolution with each one of the current frame's blocks over the regions of the next frame that correspond to the center location of the current block being tested + the requested motion-vector range to be searched, in every direction. Do the same with the next frame's blocks convolving over the current frame to get the 'reverse' motion-vectors.

>Select the 'best' motion-vector for each block, for both the next-to-current and the current-to-next frames.

>Interpolate a frame somewhere between the current and next frames by 'moving' one of the following blocks by an amount and direction dictated by a function of how good the motion-vector for the next-to-current and the current-to-next frames are:
#If there is occluded motion, use only the block from frame with the unoccluded pixels;
#If one block's motion-vector is much 'better' than the other frame's corresponding block's motion-vector, only use the 'good' one;
#If both motion-vecotors are of a similar quality, compose the new frame of a weighted combination of both, where the weight takes into account how 'good' the motion-vector is, as well as how close (temporally) the frame that the block is being taken from is to the interpolated frame (both frames will be equally close for simple frame doubling).
(In reality, you should loop through each pixel of the 'new' frame, that sits between the current and the next frames, and compute the interpolated value as a function of all the motion vectors that can reach that pixel from both frames. If you only compare the two corresponding blocks from each frame and then move the best one to the position in the interpolated frame that is dictated by it's motion-vector, then you may get regions of pixels without any value, i.e. 'holes' in your interpolated frame.)

Now, you would like to check for a duplicate frame, which would have a large majority of the motion-vectors simply being (0,0). That would only be possible near the end of the algorithm (when you know the 'best' motion-vectors for each direction), where you would then choose to 'discard' the next frame and redo the calculation with the current frame and the frame after the 'discarded' one (except if the number of repeated frames remains constant throughout the video, in which case you can simple use something like one or more Avisynth calls to SelectEven() or SelectOdd(), before the video gets to SVP's interpolation functions). This loop will continue until you find a next frame with some useful motion between it and the current frame. If you then simply interpolate the number of discarded frames multiplied by your interpolation factor, then 'jump over' those 'discarded' frames and continue your interpolation from the 'next' frame you used for the calculation, then you should have the same number of frames as if you had done the interpolation for all those duplicate frames. The computational cost would also come to about the same, but it would require some more logical comparisons to check whether or not you need to drop the frame. Those checks also have to be done for all frames that actually do contain motion, further decreasing performance. Now, this wouldn't be too bad on a CPU, but here is where the problem with APU processing comes in:

GPUs are very good at doing things like massive matrix calculations, image convolutions, interpolated frame composition, etc., but slow down immensely if you don't frame the problem as a matrix computation. Basically, lets say you compute the absolute difference between 2 frames, as an unsigned value at least 16-bits in size, pixel by pixel, via 2 methods.

First method: You do everything on the GPU via two loops (the outer one for, say, the columns and the nested one for the rows) and inside the inner (nested) loop, with which you then calculate:
a=pixel1-pixel2; if(a<0) then return -a, else return a;

Second method: You only do the following on the GPU: a1 = frame1-frame2; a2 = frame2-frame1; (both done as one massive SIMD command).
Thereafter, on the CPU, you loop though every pixel in a1 via two 'for' loops (just like in the above method for the GPU) and for each pixel you do the following:
if(pixel_a1<0) then pixel_a1=pixel_a2;

If done on a discrete, separate CPU and GPU, the speedup of method 2 over method 1 can be anywhere from 100% to 1000%. Newer GPU compute architectures (and especially NVIDIA's CUDA compiler) are getting better and better at reducing this difference, but the core principle remains the same.

If, however, you have an iGPU, then porting some calculations over to it from the CPU may or may not bring a performance advantage. Since you are stuck with OpenCL for iGPU programming, you either need to have basically been on the design team for the GPU you are programming (to be able to transform problematic code into the most optimal form for the iGPU you are targeting), or you need to avoid such code entirely (such as in method 2).
{Yes, I know, a lot of people try desperately to justify their purchases by saying things like "OpenCL is just as good as CUDA for X, but supports everything so it's better". Well, technically yes, you can do almost everything in OpenCL that you can do in CUDA, the problem is the difficulty of doing so. GPU programming is DIFFICULT, its very difficult, which is the main reason why we don't see apps utilizing the enormous potential computational performance inside them more often.
Trying to get the same performance from the same NVIDIA GPU (or an AMD one with an equivalent theoretical computational throughput) with OpenCL that you got from a CUDA program is just making a difficult task so much harder, that almost no one does manage to get the same ultimate performance in a real-world application.}

Finally, even with optimal code, an iGPU has to split its very limited power with the CPU. Compared to the same CPU and GPU as discrete products, the maximum performance of even an optimally coded 50-50 algorithm (one that needs an exactly equal amount of CPU and GPU power) will take anywhere from 50% to even 100% longer on an integrated APU comprised of exactly the same processors. However, if you are against even mildly overclocking your setup (for some strange reason some people, who are not even electrical engineers themselves, seem to think that a moderate overclock reduces the lifespan of your components by an appreciable amount), then the difference will of course be much smaller. You can even go further and underclock the discrete setup to achieve the same performance as the integrated setup, but that only serves to further show how power limited APUs are.

Its also not just limited power that constrains performance. Having to share the CPU's memory means the iGPU has to live with a minuscule portion of a similar dGPU's memory bandwidth. On-chip Crystal-Lake style L4 memory caches can help in this regard, but they are so tiny in comparison to discrete GPU memory, that doing any sort of video processing through only that cache is completely impossible.

APUs are a nice idea, don't get me wrong. Its easy to think of an iGPU similarly to some enhanced instructions set, such as AVX or FMA, that can serve to massively improve upon a CPU's performance in certain tasks. But in practice, it just seems like having the CPU keep the iGPU on constant life support (with woeful GPU memory bandwidth and power-gating killing entire blocks every few milliseconds) just seems to negate all of the advantages in having it so close to the CPU. If, however, the choice is between two similar processors, where one has an iGPU and the other which doesn't
Also, I really don't get this reluctance to develop for a 'closed' architecture. No matter what percentage of the market a certain GPU has, why not develop your application for the one that can actually deliver the required performance? If people really want to use a certain application, then they will buy the required hardware. This is how it works in the industry, and also in the consumer market for most non-software things (say you're following a cooking recipe, which calls for baking a cake in an oven, but you don't have an oven. Is it really necessary to rewrite the recipe to obtain at least something resembling a cake by using another, more readily available, piece of hardware. A hot stone, perhaps?).

I just think that this pathological fear of 'vendor lock-in' is unnecessarily limiting the applications and features in applications that are available to us. I mean, how many programs are there that can, for instance, do what you want (i.e. interpolate a video stream containing many duplicate frames of varying lengths)?

Thinking about this in another way, paints iGPUs in a much better light. What about systems that have a discrete GPU, but where the CPU just also happens to have an iGPU. With the massive transistor budgets available these days, simply integrating an iGPU onto the CPU die shouldn't sacrifice too much potential performance (if any).
Now we have a scenario where people have capable CPUs and discrete GPUs, but where coding in OpenCL allows developers to access a second, integrated iGPU when and where it would be most beneficial to do so.
Through OpenCL programming an AMD iGPU and an NVIDIA dGPU can simultaneously even be working on the exact same problem, with each GPU doing work that is most suited to its compute architecture!
The CPU's available power budget can also, theoretically, be utilized much more completely and effectively by interleaving CPU and GPU commands (such as doing some GPU work while the CPU waits for data to come in from the dGPU, for example). If the iGPU's wake up and sleep sequences are fast enough, it can even be used in place of SIMD instructions like SSE, AVX and FMA. The potential benefits in this scenario, are simply unimaginable.

Maybe I'm just a bit of a pessimist, but it doesn't seem like we will be seeing much use of this second usage scenario of HSA. I just don't get why though. Maybe the lifetime of a GPU compute architecture is just too short for developers to spend the time on optimizing an OpenCL workflow for that specific processor?
Or maybe most people still aren't even aware of the potential gains of simultaneous computation with a dGPU and an iGPU?

VB_SVP wrote:

I'd love to see FMA and GPU SAD (especially as that could be used to efficiently detect and delete duplicate frames, further smoothing out video) come into use.

Me too. Even though I have 'invested' in AVX2, I'd much rather leave my CPU 'under utilized' if that means we can get GPU-based pixel metrics calculation. Heck, without having to do all the pixel metric calculations, the CPU would be much less loaded than it is now, so AVX2 would be less necessary as a result. GPU-based calculations would also open the door to much better image quality (think SSD/SATD in stead of SAD).

Actually, FMA would also help to compute such pixel metrics much more efficiently (although not nearly on the same level as GPU acceleration would). The only problem with that would be the performance required (even with FMA), would probably put it out of practical reach for all AMD owners anyway. Maybe Zen would make it more practical?
FMA also doesn't really help much with standard SAD calculation and would require someone to write the entire assembly functions from scratch (I don't think the x264 codebase includes any FMA pixel metrics), but it would be of great benefit to things such as bicubic frame resizing, gamma correction and colorspace conversion (if and when SVP makes use of such features (actually, SVP already uses frame resizing, but that could maybe be implemented more efficiently in the svpflow2.dll GPU-accelerated part of SVP)).

VB_SVP wrote:

As an AMD user, which got with the FMA program years before Intel did

That really is extremely sad. Even though I have only once owned an AMD CPU (the original FX series), I really hoped that the Bulldozer architecture would encourage developers to change their coding styles towards relying more on specific per-processor optimizations and less on outdated, general instructions sets. Since the 'thin-and-light' craze has capped CPU performance, the only big performance advances have been, and will continue to be, made by developing increasingly specialized and exclusive instruction sets that are optimized for specific classes of problems.
If FMA had caught on, Bulldozer's performance would have exceeded Intel's in all video processing (and gaming post-processing) applications, among others, which would have forced a new 'instruction set arms race' and continued linear performance scaling together with Moore's law.

Nintendo Maniac 64 wrote:

Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?

Uhg, it seems my reply never actually posted!
Mostly yes, that is correct. Even AVX itself wont help much, its actually the second revision, AVX2, that contains the required instructions.

SSE3->4.2 can also provide a bit of a speedup, but it would be the same amount of work that implementing AVX2 would take for about 20% (mabe even less) of the gain.
The FMA instructions, however, do have the potential for some very nice gains in code of the form x = (a+b)*(c+d), which is very common in any video or audio processing filter (indeed, video upscaling and downscaling, as well as audio upsampling, down sampling and equalizing, almost entirely consist of loops of millions of x = a + b*c or x+=b*c).
But for SVP, it seems that AVX2 would bring the most benefit (though actually processing the SAD calculations as entire frames and kernel blocks, on the GPU, would be able of speeding up SVP's main bottleneck by at least an order of magnitude, over and above what AVX2 would be able to do. Its just also about 10x harder to actually implement hmm ).


Chainik wrote:

I prefer to think that x264 guys are extremely experienced with all that stuff

Yes, of course. I was just wondering what those x264 calculations would bring to an integer codepath. It seemed strange for SVP, thats all... Then again, SVP itself comes from the very 'strange' MVTools, so there's that. smile

Nintendo Maniac 64
Yes, its just that a lot of people have a previous-gen GPU still lying around that they can use for SVP, because activating the IGP, even if it doesnt do much work, still takes up quite a bit of power (from 0V power gated to 1.xV and 1000+MHz). And since almost all modern chips are power limited (unlocked chips will need some more cooling and may need to drop a few 100 MHz, if overclocked) it would be in badhomaks' best interest (to be able to run those settings I recommended) to be able to run the highest overclock he possibly can.

Of course, backing off on the block overlapping, for instance, would make running both the CPU and IGP, at the same time, much more feasible. Then again, maybe he already has a 5GHz 6700k with some overclocked IGP in there to boot. smile

badhomaks wrote:

Hey, you seem really well versed with this, what can you recommend to get the smoothest video? Cause for most videos I max everything out and have some cpu power to spare.

Thank you for your kind words. Nintendo Maniac 64 is right, if there are a lot of thin lines (or, rather, many high-amplitude and high-frequency spacial components), then complicated will 'smooth' them over if it cannot find a smooth progression from one frame to the next. Please see here for the general difference between 'smooth' and 'sharp' SVP settings (when it comes to difficult moving objects that are not very thin (the thin lines get blurred into the wavy image for the 'smooth' settings).
What I would recommend depends a lot on your system specs, but this is what I'd recommend for the smoothest real-time interpolation of standard high-quality (>5Mbps x264 encoded @ High10 & slower preset or more) 1080p23.976 series to 1080p60:

Use the same GUI settings that you posted, except for setting the 'SVP Shader' to "23. Complicated" and setting 'Artifacts masking' to "Weakest".
Then make use of the following values for override.js (a settings file in the main SVP install directory that can override the, normally hidden, settings that the developers deemed should not be altered (which is normally the case as these can do much more harm than good). Also, if you have the time, you may want to look over the documentation of these Advanced SVPFlow Options and the MVTools2 parameters, to better understand what the overrides do.

Remember to first create a backup of your original override.js file (so that you can easily restore your SVP configuration to the way that it was). Then try out my override.js recommendations by simply overwriting the default file in the SVP directory with the one that I attached here.

EDIT: I would also recommend running MadVR on a second discrete GPU (if possible) and to use a sharp bicubic scaler with the anti-ringing filter, as well as making use of the available sharpening post-processing options, to sharpen up the blurry SVP output (remember to check that this doesn't expose a lot more artifacts and doesn't make existing artifacts much more visible).

In another (perhaps even more impressive) benchmark, the 6700k was MUCH sharper than anything that came before, however Intel still needs some pointers on 'cutting the cheese'. lol

Nintendo Maniac 64 wrote:

Also I'm not sure why you're bringing up old CPU architectures like the Pentium 4 in regards to my mention of processor models that are only a year old or so like the Pentium G3258...

Because its the first in the list of CPUs that SVP is designed for. (aka. SSE2). Also, I'm sorry, that rant wasn't aimed at you at all. I realize that you were only asking about the impact of building AVX2 libraries on 'lower end' CPUs. Luckily, after looking through the code a bit, it seems like SVP is already written to be able to use AVX2 instructions without interfereing with the standard SSE2 codepath (at least for the computationally intensive pixel metric calculations). It shouldn't be too difficult to follow the same template for other functions. At worst, the installer can detect your CPU and install the correct library if it were to be compiled for different architectures.

Chainik wrote:

parallelization is a cheating  
we're already in heavy multi-threaded environment and we're not interested in single-threaded performance

Yes I understan that, but Avisynth's multithreading environment seems kind of funky to me (I have run some scaling tests on a 4-core 3770k with the default libs, like the one you posted a while back with the AMD CPU, and have found the optimum number of Avisynth 'threads' to be 22. Setting threads=8 leads to more than 50% less performance (I'll post my scaling results for Avisynth 2.5.8 and 2.6.0 shortly)).

Actually, after reviewing the code a bit, it seems like all the framework is already in place for proper AVX2 support on the assembly level.
Do you know of any work that was, or is being, done to enable AVX2 SAD calculation?
One more question: it seems your code also has 'support' for plain AVX instructions in calculating pixel metrics. Do you have any idea why someone would have put floating-point AVX together with the other integer SSE2 optimizations?

sparktank wrote:

I seem to remember reading when I wanted to get into programming, that main thing I wanted to learn was to update some things for AVX extension, but then read a lot of things (I mostly don't remember) that said it's not really worth it.
Short/long math is where it really counts?
And in most cases for AVS users, it doesn't count for us so much.

Yes, plain AVX instructions only work on floating point data (32/64-bit fractional numbers), while video data is stored as 8-bit unsigned integers (256 numbers from 0 to 255). That means AVX only benefits workflows that require the precision of floating point numbers (which are much slower to work with than integers).
AVX2, on the other hand, does work on 'packed' 8-bit data, so would provide a very nice speedup over 'legacy' SSE~SSE4.2 instructions.

libiomp5md.dll is nowhere on my system.

it seems there are no more static libraries available for distribution regarding OpenMP.

Ah, I don't know why it did that (I didn't enable the OpenMP language in the project settings and neither does SVP contain any OpenMP instructions what so ever), but thank you very much for reporting back.

I'll try to get the compiler to not link against OpenMP and test the resulting libraries by uninstalling all dev kits on the old 3770k-based system and running it there. If I get it to behave, I'll certainly post the new libraries for you to test as well.

Chainik
Here are my current C++ compiler options (as taken from the VS2013 project property page, under c++, Command Line:

All Options:
/MP /GS- /W3 /QxCORE-AVX2 /Gy /Zc:wchar_t /I"C:\Users\Xenophos\svpflow\src\jsoncpp\include" /I"C:\Users\Xenophos\svpflow\src\jsoncpp\Release" /Zi /O3 /Ob2 /Fd"Release\vc120.pdb" /Quse-intel-optimized-headers /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "SVPFLOW1_EXPORTS" /D "_WINDLL" /D "_MBCS" /Qipo /Zc:forScope /arch:CORE-AVX2 /Gd /Oy /Oi /MT /Fa"Release\" /EHsc /nologo /Qparallel /Fo"Release\" /Qprof-dir "Release\" /Ot /Fp"Release\svpflow1.pch"

Additional Options:
/Qipo:128 /Qunroll:256 /fast /QxCORE-AVX-I /Qip /Qopt-args-in-regs:all /Qopt-class-analysis /Qopt-dynamic-align /Qopt-jump-tables:large /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 /Qsimd /Qunroll-aggressive /Qinline-calloc /Qinline-forceinline /Qinline-max-per-compile- /Qinline-max-per-routine- /Qinline-max-size- /Qinline-max-total-size- /Qinline-min-size=16384 /Qinline-dllimport

I'm not at all certain that these options are a (or 'the only') member of the set of compiler options that achieve the best possible performance for a typical SVP workload, but I do think it is a decent start to benchmark the effect that the compiler has on SVP's performance.

Chainik wrote:

xenonite
I believe we've already optimized that code as much as it's possible

Yea that's what I thought too... can't hurt to try though!  big_smile
I don't see any vectorizing or parallelizing hints to the compiler in the source code, so I might try to better guide it's compilation before I start trying to dig through the x264 assembly source.

One more question: Have you ever run a performance profiler on the complete SVP program and, if so, approximately what fraction of the total time taken per frame was spent within Search.cpp.

EDIT:

Nintendo Maniac 64 wrote:

All AMD CPUs and APUs since Piledriver support AVX

This is true, however, the Intel compiler has been known to artificially gimp performance on AMD processors. It would be a far better idea to compile with GCC for AMD processors.
Please see here: http://www.agner.org/optimize/blog/read.php?i=49 for a description of the problem,
and here: https://software.intel.com/en-us/articl … ice#opt-en for the official Intel comment on this issue.

Nintendo Maniac 64 wrote:

more importantly is that Intel Celerons and Pentiums do not support AVX

Yes, but I don't see the problem? SVP by default is programmed for maximum performance on all CPUs older than at least 5 years, back to and including the first Pentium 4 processors of 14 years ago. Sure, these DLLs aren't usable on those platforms, but why would you even want to?

I completely understand supporting legacy architectures, especially in the age of "good enough {insert anything here}" where most people don't regularly upgrade their hardware anymore.
The problem I do have, is what about the people who do not achieve sufficient performance with their current hardware. In my case, I completely stopped watching series and movies for a period of 3 years (I can't just ignore or 'watch around' the things that bother me) while I worked as hard as I possibly could to make enough money to buy the absolute 'fastest' components currently available.
Then imagine my horror when I realized that all my hard work was basically for naught. I still could not achieve a 'good' level of quality and not just because "CPU performance has completely stagnated". No, the high-end Haswell (and Skylake) CPUs have the potential to double their throughput of 8-bit integer operations (the kind that is used to find motion vectors for 8-bit-per-component video data) compared to high-end Sandy- and Ivy-Bridge processors.
This basically told me: "No matter how hard you work, you will never be able to buy the level of performance you desire, since the maximum performance is limited by those who don't really care about achieving the maximum possible image quality".

I'm sorry for going off on a rant there but, as you probably gathered, this is quite important to me. Its not simply a matter of some 1%'er bragging about the amount of money he can spend on some 'completely over-the-top PC' to the determent of those who simply cannot afford the same things due to factors completely out of their control. This is about not being able to achieve anything, no matter how much money you are eventually able to make.
And its not specific to SVP at all, how many applications can you list that make use of the newest advanced processor features? I'm not even talking about 'difficult' things such as parallel programming or GPGPU; just simply caring enough to recompile your code or (if you have code that is sufficiently sensitive to have needed hand-tuned assembly optimization) to add some functions that make use of these new instructions.
[Ps. Yes, technically you can (and good software engineering policy dictates that you should) support all the CPUs from both manufacturers to their fullest performance potential without penalizing any of them by doing so. It just makes writing a simple program so much more difficult, that it becomes quite prohibitive for a small group of volunteers, working mostly in their off-time, to achieve their goals in a reasonable time frame. In those cases, its simply a matter of being able to get the app out the door in the first place, which, I believe, is a much better excuse than simply optimizing for the lowest common denominator and not caring about the rest.]

So, in a nutshell, this is why I decided to start properly learning C++ programming.
Do I think that I will be able to make a notable difference, even if I successfully master modern multi-threaded programming? No, of course not; I'm not THAT stupid... but that won't stop me from at least trying! tongue
Also, Chainik, you guys absolutely rock. This really doesn't get said enough, but cleaning up the MVTools code-base to the point that it is at today was some truly amazing work.

mashingan wrote:

Xenonite

I got "Platform returned 126" error. What is that?

Is there any way to test it? I simply did it by copying it to plugin folder in SVP folder though.

I apologise for responding so late... I forgot to mention that these libraries are for SVP 3.1.7 (although they should also work for SVP4).

I don't know why you got that error... It could be that you are using Windows XP? Or a 32-bit build of Windows? Or maybe an AMD CPU or a CPU older than Sandy Bridge. Other than that, I can only guess that some configuration to you OS is causing problems (maybe some non-standard SVP installation that wasn't properly uninstalled?).

Either way, I did test that they do actually work on my development laptop (a 17" sager haswell desktop replacement system), on a old Ivy-Bridge based 3770k system with SLI 780ti GPUs that I have lying around and, of cource, also on my 5960x-based HTPC, so they should just work without any other configuration.
I do, however, not have any OS other than Windows 7 (I have no intention to "upgrade" to windows 10 except maybe for a dedicated DirectX 12 gaming box somewhere down the road), so that might be the cause of some of your issues?

Chainik
Wow I was completely unaware of that. It does, however, explain why my MSVC builds are constantly slower than your default builds (I have been pouring over my development environment, checking and rechecking evert single setting to find why I can't reproduce it, just because I assumed you had been using MSVC all along).

Anyway, thank you very much for providing such optimized builds, I can't tell you how much it means to me that the developers of an application actually took the time to compile an optimised binary (virtually all video processing apps/utilities and plugins are just compiled with msvc or gcc and let go; sometimes even without any /Ox compiler flags or /arch designations).

I'll post my compiler settings when I get home (posting from my phone atm) so that you (if you have time) can tell me what you think and I'll also try to bechmark some important permutations to see if I can find a way to significantly (sustained average >10% performance improvement) improve on the default didtribution.

I'm still very new to general applications programming (I mainly work on the hardware side of projects, with some minimal assembly hand-tuning of a particularly problematic ISR here and there), but I will also take a look to see if there are any easy ways to optimise some of the most expensive functions and then post the code (if any, I'm probably being way to optimistic) for review. I can't run any complete performance profiling tests since I only have the svpflow1 code to work with, but I believe most of the work is done in PlaneOfBlocks, yes?

Hi guys, I've made a few optimized compiles of the latest Open-Source svpflow1.dll library with the 2015 version of the Intel Compiler on Visual Studio 2013.
I have not altered the source code, only set the compiler options and compiled it for the different AVX architectures and to maximize runtime speed.

Since almost all of the computationally intensive code has been hardcoded to SSE assembly, don't expect these builds to dramatically improve SVP's performance. However, they should be the most optimized that you can get for CPUs younger than 5 years old without actually changing the source code.

NB. Mods: If these builds are not useful / break any forum rules, please don't hesitate to delete this thread. I only post them incase there are some users that would like an (at least partially) ICL compiled SVP.

EDIT: The files have been removed, until I can figure out why OpenMP keeps imposing it's inclusion. Also, I want to do some further tests with the compiler optimizations, since aggressive loop unrolling seems to hurt, rather than help, SVP's performance (even though the .dll's file size only grew to about 1100kB).

Nintendo Maniac 64 wrote:

16px Average 0 also helps out decently as well (good balance between smoothness and performance)

Cool. I believe the computational complexity increases quadratically, i.e. O(N^2), while the quality degradation is extremely subjective (such as very small blocks actually degrading the image quality), but probably only degrades linearly with increasing size.
Now you have given me ideas; I am busy disabling most of my cores through the bios and underclocking to 1.8GHz to see what the best image is that I can produce under those circumstances.  lol

Nintendo Maniac 64 wrote:

TDP does not equal performance

Yes, in general you are correct.

What I was trying to explain, was a easy metric that people (who do not know about the relationship between architecture, IPC, transistor dimensional scaling, core count and clock frequency) could use to quickly determine if their PC has the potential to run SVP 3.1.7 at a higher quality than SVP 4.
With all "relatively recent" (i.e. released with or after the Sandy-Bridge generation) CPUs being gimp'ed by consumer opposition against powerful processors, all of the 'high TDP' processors are performing relatively similar (compared to the generational improvements before), especially in SVP (a 5-10% difference would not reliably make a higher quality level attainable).
With that in mind, neither clock speed, nor model number, nor core or cache count, nor any other single parameter that is listed on the physical packaging or on the official online Intel "Tech specs" listing correlates as well with the CPU's potential SVP performance as TDP does.

Nintendo Maniac 64 wrote:

[...] easy example is Broadwell which is only 65w TDP yet has slightly better IPC than Haswell and has the fastest iGPU Intel has ever made.

True, but if someone were to try to use the IGP while running SVP on the CPU (perhaps even using the IGP for SVP), then there might not be enough TDP budget left for a heavy SVP 3.1.7 configuration.

Also, I never said that if CPU A's TDP > CPU B's TDP >= 80W; THEN CPU A's performance = CPU B's performance * f(CPU A's TDP / CPU B's TDP), where f is a linear or sub-linear function.
I understand that AMD doesn't have access to Intel's advanced process nodes, so their TDPs will be significantly higher simply due to the manufacturing node being optimised for low-TDP parts (<50W) and their CPUs being designed to run at high clock speeds, meant that they had to use a significantly higher voltage than the process was originally intended for (and power scales with the square of voltage). Also, TDP figures are not comparable between manufacturers. So, yes, TDP is not a perfect performance measure by a long shot, but what other single 'spec sheet' number is? Maybe price, lol.

Anyway, I simply thought that someone who doesn't know whether to choose SVP 3.1.7 or SVP 4 could quickly look up the TDP of their processor, since it is the single most important parameter influencing modern consumer purchases (allegedly). If their PC is less than about 4.5 years old and they have a CPU TDP > 80W, then they should be able to use higher quality settings in SVP 3.1.7. And that was all that I meant by it.

Nintendo Maniac 64 wrote:

Thread I made regarding the subject:
http://www.svp-team.com/forum/viewtopic.php?id=2699

Thank you for pointing me to that thread; I never would have believed that a Core 2 Duo can run SVP at all, let alone as well as you described.
Very interesting indeed.

Ps. It seems that Chainik has indicated that, once SVP4 is fully released, you will indeed be able to fully configure all of its settings, just like you can with the current SVP 3.1.7.
So then the question becomes moot, SVP 4 will become better than 3.1.7 in all aspects; a.k.a. progress without regression! big_smile

17

(1 replies, posted in Using SVP)

When you have such a low resolution that the spacial aliasing begins to mask (or rather, destroy) temporal information, SVP will have a very tough time to produce a proper output.

As a first possible fix, you could try creating a low-res profile with all the GUI values jacked up to the right and change the values in override.js by backing up the default file and overwriting the original with the one I attached (remember to replace the original when watching other videos).
Prior to the video stream hitting the SVP plugin, you want to interpolate the resolution to something that SVP can actually fill its interpolated data into (such as 1280x720 or 1920x1080; remember to change the pel value to 2, the overlap value to 2, and the blocks value to 16, in the override.js file, for 1080p), using a Cubic filter such as Robidoux, Robidoux Sharp or Catmull-Rom (which wont cause such an immense amount of ringing). After SVP you can then scale the final image to your native display resolution (preferably using MadVR with a Cubic or Catmull-Rom).

If your PC can't handle this in real-time, first try running a small sample of a video through this procedure with avisynth and virtualdub or something similar, to check if it fixes your problem. Then you can start lowering the quality of my suggested values (start with Pel, overlap and blocksize, while keeping an eye on what the effect is on your video), until you can run it in real time.

(Ps. if you want, you can upload a sample somewhere and I'll run it on my end and send you some screenshots of areas with problematic motion comparing the default settings with my suggestions (unfortunately I can't upload any videos as I was born, and still live, in South Africa and I don't have a pigeon to send to you.))

(Continued from above)

Then these images approximately show the different 'style' of interpolated (simple frame doubling) frames produced by the two SVPs:

                                                        SVP 3.1.7
SVP317

                                                          SVP 4
SVP4

Basically, SVP 4 produces 'jumpy' images (like watching a 24fps video on a 48hz el-cheapo lcd with very slow pixel response times of >20ms), while SVP 3.1.7 can be configured to produce more 'wavy' artifacts, but actually moves the objects in the frame to produce a new frame with the objects in-between their respective locations in the previous and next frames (which I find to be the whole point of motion interpolation, but some {most?} people seem to prefer the SVP 4 way).

This does, however, only affect difficult motion areas. In places without much temporal aliasing to begin with, both SVPs perform very well and generate a proper interpolated image.

SVP 3.1.7 can be configured to provide the exact same performance and "quality" as SVP 4, but the reverse is not true.
SVP4's current settings is optimised for low-end hardware and producing temporally aliased images, which you may or may not like.
So, if you have old/slow/low-power hardware, you won't be able to use any advanced interpolation present in SVP 3.1.7 anyway, so using SVP 4 would not make you lose anything. However, if you have reasonably recent hardware (with a CPU TDP > 80W) AND you like the 'smoother' (read: reduced temporal aliasing) images produced by a 'high-end' configuration of 3.1.7 (see the comparison below), then you shuld stay with SVP 3.1.7 (at least for now).

To give you an idea of the differences, I have created these simple high-contrast 2D images (to reduce the attachment size and better illustrate an exaggerated example of the differences). Imagine this simple translational movement as representing a difficult motion to interpolate (big displacement, occluded pixels, non-uniform lighting, etc), from image a to image b:

                                                      Image a:
a

                                                      Image b:
b

Ps. Continue on next post because of 3-attachment limit (need 4 images to display the difference)

Blackfyre
I really didn't mean to offend anyone and I certainly don't see myself as particularly 'enlightened' (although we EE's seem to always come across that way, and for that I do humbly apologize).
My post was actually not aimed at you at all (indeed, I understood and agreed with what you said), I was just trying to clarify what we are talking about, since it seemed like not all of us were talking about the same thing at all. However, I do tend to lose sight of the big picture, which is that any SVP is way better than no SVP ( hmm trying to remember a time before SVP).

Bong34
You said you were "getting a new setup"; If that entails also buying a new GPU, I would definitely recommend waiting for the new 14nm GPUs to be released. At least to see if they got Samsung (or maybe even GloFo...) to do a special 'High-Power' node for their products (all current 14nm processes, excluding Intel, are either 'very-low-performance' or 'low-performance' for small, mobile-device SOCs). They reason I suggest waiting (which I wouldn't normally do, since there is always 'something new' on the horizon), is because a GPU built on a dedicated 14nm HP node will be able to easily double the performance of even the fastest Titan X card, in the same 250W package. If, however, they have to make do with a LPE or LPP node, then we can expect (at best) a 50% performance increase in the same 250W package.

Either way, I doubt it will make any difference to SVP's performance (since SVP doesn't actually do all that much on the GPU), but thought it relevant since you mentioned gaming performance as another priority.

mashingan
I certainly understand that the 'best' settings does not simply mean 'higher numbers must be better'. Also, anime content requires much different settings and performance than 'filmed' series or movies, which is why I said something quite similar to what you pointed out in a earlier post I made to another thread.

Sure, a 6700k is probably "enough" for most people's anime interpolation preferences, but that is only because it rarely contains any mathematically coherent motion to actually interpolate.

For all other use cases, no CPU & GPU combination currently exists that can deliver anything close to 'good enough' performance. Then again, some people like their interpolated frames to still contain aliased motion (aka. the whole 'film look' vs. 'soap opera effect' debacle) and SVP4 makes achieving that with very few other artifacts quite easy with very modest computational requirements.

Nintendo Maniac 64 wrote:

...I do not understand why some are saying the 6700K is a better choice.

This is actually a pretty complicated issue, but the simple answer is a combination of:

1) SVP not being very well threaded (or more accurately, the MVTools library running on avisynth has a badly hacked form of threading support (optimal number of threads for maximum SVP performance on a 4-core 4790k = 24) that does not scale very far (hard 32bit memory limits my 8-core 5960x to threads = 26)) and has a pretty high threading overhead) which leads to a fairly bad case of Amdahl's law

2) Skylake has a ~10% IPC advantage over haswell (properly vectorised AVX2 code sees a much larger increase (almost 50% over Haswell's AVX2 code execution) but no one writes code like that these days), and the 6700k overclocks (on average) another 5-10% further. Together with the faster caches and memory (assuming decent DDR4 memory since you have to buy new RAM anyway), I would put the 6700k's net per-core advantage at around 20%, which is very close to the real-world benefit of 50% more cores.
(Having an 8-core 5960x, I can testify that doubling the cores from the 'standard' i7 4700-series makes absolutely no difference in 99% of applications and games; if I had the choice between a 32-core @ 4Ghz CPU and one with the same architecture, but only 4 cores at 6Ghz I would jump on the latter in an instant).

However, if you already have a 5820k, getting a 6700k will not improve your performance in SVP, they should perform very similarly if both have been given a solid overclock. It probably will improve your performance in most other applications a bit, but don't expect a constant 20% improvement in all single-threaded apps either. That is the real reason for recommending the 6700k over the 5820k, not because the 6700k will be much faster in SVP.
Also, using the IGP in any way would be a very bad idea. Combining an otherwise equal CPU and GPU onto one die just serves to dramatically reduce the performance of both. If intel could replace all that wasted die space with some more cache and expanded speculative execution resources, then we would probably have an instant 50~60% jump in IPC, but they'd never do that... more useless cores markets way better.

Chainik wrote:

Blackfyre
I still cannot max out all the settings

good for you  big_smile

A simple statement that keeps floating around that does not really explain the problem that well. 'maxing' the settings in SVP's GUI by simply pushing all the blue bars as far right as they will go does nothing but make most people's PCs drop frames (and display severely artifacted ones in many other cases).

As I (and indeed all the developers aswell) have said before, this is why there are so many reports of SVP4 looking so much better while requiring less CPU resources to do so (apart from most people simply preferring the 'sharp' look of temporal aliasing). If people don't educate themselves about the algorithm being implemented by actually reading the source code, seeing where those variables go and understanding how that mathematically alters the accuracy of the extracted motion vectors in different situations, then how are they supposed to set the 'best' values for their own system?

Jeff R 1 wrote:

Blackfyre _  However for the last 6 or so months I have been running @ 4.7Ghz constant & cooled (Note: I still cannot max out all the settings).

I wonder just how much it would take to max out the settings _ are you using and GPU at all ?

Well, I can give you an idea if you'd like. My system specs is in my profile (a 5960x @ 4.7GHz, SLI Titan X'es @ 1.3GHz and 128GB RAM) and my current setup is to encode my video and watch it later since I get about 0.8~1fps. Getting such a high speed requires splitting my video file into halves and encoding each with 4 cores and one of the Titan x'es in a separate virtualdub instance (this explicit parallelism almost doubles the processing framerate).
I believe these SVP settings that I am using for high quality Blu-Ray playback is about as far as the current SVP libraries can be pushed with regards to maximum SSIM (what I optimise for; aka. the subjective part) for good, noise and artifact free sources in a 32-bit process.
So, to answer your question, to 'max out' SSIM with the current SVP libraries in real-time (assuming 24->60fps) would require a system capable of around two orders of magnitude more processing power than what I have, without resorting to more cores or more GPUs (since the split file hack can't be used in real-time), which is equivalent to around four times! the performance increase we got (while clockspeeds were still being increased between 1985 and 2005).

Theoretically, this can be done within the next decade, but would require a complete re-write of the MVTools library with explicit multi-threading and advanced AVX-512 inline ASM optimisations for all data processing functions to efficiently offload the data to a CUDA (OpenCL could also work I guess) implementation of the actual MV-searching and determining functions which (if you have taken a look at the MVTools library) is a task that would probably cost a few million dollars in software development (although with these Russian coders you never know... they have this astonishing way of just getting stuff done with much less resources than most people thought would be required).

dlr5668 wrote:

Bong34
Both are more than enough for SVP.

No, this is simply untrue, however, the 6700k would indeed be a much better choice.

Even if we redefine 'enough' as simply meaning that more computational performance does not lead to any perceptable increase in interpolation quality (in stead of the 'correct' meaning of the interpolated frames being mathematically identical to a (non- temporally aliased) native high framerate recording); the statement remains false. There is no single CPU nor any multi-CPU server that is 'enough' for SVP (and no that is not only for 4k, I am talking about 1080P).

After I upgraded to my current 5960X-based system, I received a very noticable increase in image quality (still not what I would call 'good', but I don't think it is reasonable to expect any interpolation software to be able to generate a good picture from such a heavily aliased source as a 24fps recording).
After I overclocked said system to 4.7GHz, the quality did not increase, merely because SVP was not developed with such systems in mind.
However, redefining the values in the override.js file, did in fact significantly improve the quality of SVP's interpolated images, while also running at around 80~90% CPU load, on average. Simply put, my 5960X is not up to the task of providing enough performance to allow for a proper interpolation to be done (at 1080P, I don't even try 4k).

So while SVP3 (and certainly not SVP4) will not be able to make use of increased CPU or GPU power at default, editing the configuration files allows SVP to make full use of any CPU you can give it, with the accompanying massively improved image quality it produces.

I am in the process of thoroughly testing and documenting the effect that these 'hidden' settings have on the quality and CPU load of SVP's interpolated output (similarly to what has already been done to compare different image upscaling algorithms on other forums).
It is slow and tedious work (I am using the lossless SVT_MultiFormat sources which are 48MB per frame (more than 20GB for a 10s native 50fps video) and SVP is almost always running at less than 1fps), but I hope to be able to demonstrate, to the developers, the significant improvements in quality (with proper mathematical similarity metrics to back it up) that allowing these heavier settings has.

Could be a couple of things...
Please post your SVP settings (and any alterations to the override.js2 file) and also let us know if you use MadVR. There may be some misconfigured settings (such as not forcing SVP to use GPU_21, or something in MadVR's seeking and presentation qeue settings) that are making SVP slow to catch up.

Also, have you: tried having your notebook plugged in and fully charged while playing with Windows' power profile set to 'Maximum Performance'; configured the NVIDIA control centre to always use the discrete GPU and also to 'always prefer maximum performance' (at least for the MPC-HC profile)?

Ah right, I was afraid of that.

Well, maybe the Avisynth Virtual File System can be used as a possible workaround?
From their website: "AVFS is a virtual file system that exposes the output of Avisynth scripts through the file system as a set of virtual media files. This allows Avisynth to feed media applications and converters that do not use the VFW API."

Mystery wrote:
Chainik wrote:

Windows 10 update breakes .avs file association

That explains.

How do you work around this?

Upgrade to Windows 7?

However, Avisynth does not work by altering the default program associated with .avs files and neither does SVP require any particular default program association to function. Think about it, ANY video player can open .avs files as a valid video file; all of them cannot possibly be the default associated program for .avs files.
Incidentally, you can test this quite easily on Windows 7 (don't know about 8, 8.1, 10) by right-clicking on any .avs file and selecting "open with" and then "choose default program". In the popup window that appears, choose any simple text editor such as Notepad, Wordpad, Notepad++, AVSP, etc.
Double-clicking on the .avs file will then open it not as a video, but in the chosen text editor; however, if you load SVP and play a video with a correctly configured MPC-HC player (have not tested any other players, but it should also work the same) SVP will still correctly load its script into the ffdshow avisynth plugin and start smoothing your video.

Unfortunately, I do not know how Windows 10 breaks the processing chain (I do know that Directshow has been deprecated since the Vista days, however, as bad as still using Directshow is, I very much doubt that Microsoft would have just killed off all Directshow processing without so much as a word...), but .avs file associations seem like a very unlikely cause.