26

(4 replies, posted in Using SVP)

Chainik wrote:

It's a video driver's job to deal with sli

Only for SLI based frame rendering though...

I was under the impression that SVP uses OpenCL to run calculations on the GPU in addition to the frame composition and rendering?

Are you saying that the GPU is only really used to render the video frame in the same way that 3D games are rendered (and so can be sped up with the development of a dedicated AFR SLI profile?). If so, maybe we should all petition NVIDIA (and maybe also AMD?) to develop such a profile.

With regards to multi-GPU compute, however, even the additional abstraction and unified memory model of CUDA, would not be able to save your application from needing to be specifically designed with algorithms that are explicitly aware of where they are being run and how data is passed between them and the main system. The driver cannot possibly abstract that away, especially if the code in question is OpenCL.

Kondekka wrote:

The SVP 4 seems to be running like a charm at highest settings, the 3.1.7 had some trouble and made my pc a toaster

Yes I have also noticed that...
I am glad that you found a worthwhile improvement even with such a early build of SVP4, but your comment perfectly illustrates a big problem that I have with the 'new and improved' SVP4. Your PC is having such a easy time with the new 'best' settings, because it is not doing the same amount of work.

As far as I understand it, SVP4 is being developed not to improve on the best possible quality or speed that you can achieve relative to the 'old' 3.1.7, but rather to help more people get the 'best' experience that their hardware is capable of. MOST people (everyone except me? lol) seem to prefer a more film-like interpolation rather than having 'halo' artifacts and blurry motion. The upshot is that you can achieve that effect with LESS computational power than what you would need for the 'blurry' image with artifacts.

27

(36 replies, posted in Using SVP)

Mystery wrote:

If there are changes to the engine, it's kind of hard to compare apples to apples because the way configuration settings work is completely different.

Hmm, I actually think that the comparison would be very fair,
since both use block-matching for motion vector determination,
both calculate the same error metric (SAD or SATD) to pick out the good MVs,
and both will be using the 'brightness constancy' and maximum vector field divergence (i.e. a 'smoothness' constraint) .

Although the processing performance and level of optimisations will no doubt be quite different, the output images should still have the same qualities and (somewhat masked) artifacts aswell. Also, both should be able to support the same types of settings and arguments (some might be hardcoded in SVP4 for 'user friendliness' though)

28

(8 replies, posted in Using SVP)

Jeff R 1
More masking is normally not a good idea, because even though it does 'cover up' some artifacts, it also makes the motion much less smooth (which is the whole reason we are using SVP, no? tongue ).

Anyway, I do agree that a user controlled option (even a normally hidden one) would be the best solution, but it seems like the main (only?) goal of the new SVP4 is 'user-friendlyness' (hardcode most settings and hide the rest; while also incorperating computationally cheap and common image processing functions) and making it run on other operating systems while maintaining 'good-enough' (good enough for who though?) performance on old/cheap laptops.

While I may not like or support these development priorities, I can at least understand the rationale behind it; both as a means of generating more donations (and increasing the final product's sales among the 'clueless masses') and maybe even as a way to promote the virtues of high-framerate videos and movies to dispell the 'soap-opera effect' myth.

Also, those sharp (high spatial frequency) artifacts can unfortunately not be completely eliminated by any combination of mask settings, while still getting a smoothly interpolated result for the same masked pixels aswell.

The main reasons for the creation of these artifacts in the first place (and why they are not cobsidered fixable bugs) are two-fold:

Firstly, SVP uses block-based matching to determine the relevant motion vectors so that all SSE2 CPUs, back to even the very first Pentium 4 chips that are more than 14 years old, would be able to run it.

Secondly, the 'Aperture Problem' combined with temporal aliasing, requires that SVP's interpolation algorithm relies on some axioms regarding the motion flow vector field, while these axioms are most commonly derived from the 'brightness constancy constraint' and the required 'smoothness' of the motion field.
In your case, the beam of light changed the luma of the pixels 'under' it relatively much more than their chroma values, thereby violating the constant-brightness axiom.

However a bit of good news (even if the final SVP4 version ships without any of the current 'advanced' user controllable algorithm values), if you are prepared to give up smoothness to reduce artifacts in 'problematic' areas such as those, then using SVP3 with some tweaked 'override' parameters (and a more powerfull pc) will probably produce an image that is as artifact free as that which the final version of SVP4 would produce.

29

(37 replies, posted in Using SVP)

No, I was trying to explain the difference between the two, but I was talking about the single pixel's response time being artificially made 100-1000x slower than the sub 0.01ms that a single non-AM OLED pixel would be capable of, completely independant of the panel's refresh rate.

But yes, I am waaay off topic here; will look into that site you mentioned, thanx.

30

(37 replies, posted in Using SVP)

Nintendo Maniac 64
Unfortunately, no.
I do not know how much circuit theory you know, but what I tried to explain was that ALL current commercial OLED panels use an addressing technique which causes the pixels to respond sluggishly.

Ok, let me explain it like this: you know how a CRT TV has a scanning electron beam that lights up the display pixel-by-pixel, one row at a time? Even though it would be possible (in theory) to update the pixels on a "flatscreen" display (LCD or OLED are similar in this regard) all at once, this is never done in practice.
All displays still use the same scheme of refreshing the pixels one line at a time. And while you are correct in the sense that this scanning behaviour CAN be independent of the usual 16.67ms frame period, they VERY RARELY are, because they want the pixels to be on for as long as possible to put out the most light (and to reduce the column and row drivers' operating frequencies).

What this means in practice is that all the pixels on screen need to be refreshed by at least the end of the 16.67ms frame window. The way this is done is one row of pixels (called a scanline) at a time, so that each row has only frametime/number_of_rows ( = 0.01543ms = 15.43 MICROseconds for a 1080P/60Hz display) of time to change their state.

In the BEST CASE, the display update an entire row of pixels all at once, but this is very expensive and thus also very rare.
The usual practice is for each column driver IC to update a single pixel, one at a time, until it has updated all the pixel lines attached to it. Since there are between 2-8 of these chips in an average display (1-2 for an expensive, thinner model), only those many pixels get updated at the same time (reducing the actual per-pixel addressing time by a factor of between 10-100).

For the sake of a best case argument, lets say each pixel gets a full 15.43us of addressable time.
Up to this point, I have not even considered the actual pixels' RESPONSE times (display lag is another thing entirely and has to do with frame bufferring and digital electronics; analog processing only adds a variable 0-16.67ms of lag to it because the lowest rows of pixels get refreshed much later than the uppermost ones).
This way of addressing didn't have that big of an impact with the slow nature of LCD pixel response times, but it has a HUGE impact when the actual pixel CAN transition as quick as OLED is able to.
THIS IS THE IMPORTANT PART:
Since each pixel is only connected to an electric supply for 15.43us of each frame (best case with a 1080p display, although many OLED displays are 4k), the display driver deposits a huge pulse of electricity to the pixel's storage capacitor during that time. The capacitor then slowly discharges through the pixel, keeping it at roughly the correct value until the next frame refresh comes along. It is this sluggish capacitor response that limits the pixel's response time. Even though the OLED pixel on its own CAN switch in under 0.01ms like you said, it takes much longer for the capacitor to stabalise to the correct value.

Response time compensation techniques try to enhance pixel response by charging the capacitor with a higher voltage than is required for the current transition, which causes a faster transition (since capacitive charging rate is eaqual to the capacitance value multiplied by the time derivative of the voltage over it) but also results in RTC overshoot artifacts.

Hopefully someone will make a OLED panel that works like the old plasma display panels used to (no charging capacitor = almost instant pixel responses), but until that time, I suppose we are stuck with what we have...

Hopefully I have better explained how it all fits together this time around, but please do let me know if you have any other questions / see any errors.

sausuke
Have you tried uninstalling and completely deleting all old video player / CODEC / Avisynth folders and reinstalling the complete package from the SVP homepage?

Also, you could try MPC-HC instead of Potplayer?
MPC-HC usually performs better and more stable for me.

32

(37 replies, posted in Using SVP)

Nintendo Maniac 64
Thank you for your detailed answer.
Yes I know about the OLED TVs being sold, but they have poorer pixel response times than even my 144Hz TN-TFT LCD, due to the AM addressing scheme. I do not want to derail this thread any further, but I feel I owe you at least a basic answer. Basically, what happens is that they are addressing each pixel line-by-line, the same way you would a normal TFT LCD.

Since each pixel can only be controlled for a small fraction of the frametime (t= (column_drivers) / (fps * display_width * display_height * num_subpixels)), where column_drivers refers to the number of column driver outputs that can be active at the same time, fps is the refresh rate and the display height and width refer to the numbers of pixels the display has.
This limited time that each pixel can be actively powered would therefore result in an unacceptably dim screen (consumers flock to extremely bright displays, since they look more vibrant under the extreme showroom lights). Thus, a capacitor is included with each pixel to store enough charge to allow the pixel to continue emitting light long after the addressing pulse has moved on.

To also ensure that the display is thin, light and does not have a lot of power, the driving electronics get limited to a small amount of maximum current (since conductive losses and heat production scale quadratically with the amount of applied current).

The pixel response time then gets fixed by the amount of charge that has to be moved either to or from the capacitor, divided by the maximum current that the column driver can output. In almost all cases of modern displays, this limited output current, combined with the limited active pixel addressing time, prohibits most of the pixels from attaining their desired brightness in a single frame time.

Overdrive techniques try to compensate by 'overdriving' the pixel on the first frame and then gradually settling down to the final value, however the effectiveness of pixel overdrive techniques are limited by the tolerances involved in manufacturing the energy storing capacitors (which affects the pixel's time constant) as well as the voltage limits of the capacitors, transistors and the pixels themselves (a full range, i.e. a black to white transition can usually not be overdriven, since the final, settled voltage is already required to be at the maximum possible voltage).

I hope my explanation is sufficient to understand what I meant about Active Matrix OLED displays not really improving motion resolution as pf yet.
Of cource, in the future, someone could construct a commercial AM OLED display with sub 1ms reaponse times, but that would require thick, 'ugly?' heatsinks, high framerates and a much lower maximum luminance. Unfortunately, there does not seem to be any indication that there will ever be sufficient consumer intrest for those trade-offs to make economic sense. sad

Hmm, could you please list your filter chain, your PC specs and your SVP settings?

Failing that, I have a few suggestions that you could try:
On my G-Sync display, having G-Sync enabled caused a constant amount of frames being dropped, regardless of CPU/GPU usage.
Even if you don't have G-Sync, you could try changeing your display driver's profile settings for the mpc-hc.exe file, and setting the global default to use your custom settings. Specifically, you could try disableing all power-saving features (setting prefer maximum performance) and also try different ways of handling v-sync.
For v-sync, you should have the application (mpc-hc) handle it and not have any value forced in the display driver's control panel. You could also try manually adding dwm.exe (in C:\Windows\system32\) and forcing 'prefer maximum performance' as well as forcing v-sync 'off'. This could have an effect since I believe mpc-hc, by default, defers v-sync handling to the dwm.

I would also suggest that you use madVR as your renderrer, even if you do not use any of the advanced scaling features, simply because you can then very accurately specify the frame buffering and drawing behaviour, and you can also press CONTROL + J to bring up the performance statistics.
If these stats show that you are dropping small numbers of frames, even though your maximum frametimes over the 5 second window are well below the refresh frametime of your display, then it is almost certainly some sort of v-sync or other frame drawing / framerate matching problem and not one of purely insufficient performance.

Edit: Also, if you are using ReClock, be sure to specify the v-sync location at either the very top or the very bottom of the screen and to enable script notifications.

34

(37 replies, posted in Using SVP)

Nintendo Maniac 64
Or you can just use a big water pump and 3x480 radiators that you mount outside of your media room to get the best quality and no noise. smile
Doesn't really solve the bang per buck issue though...

I also wanted to ask you about your monitor description of a few posts back:
Are you using a CRT because of its superior motion resolution at high frame rates?
I too am very sensitive to the motion abnormalities of 'modern' displays, unfortunately I also have a stupidly high flicker fusion threshold (CRTs' flicker, even at 120Hz, becomes completely unbearable after 30 minutes or so. Also, I simply cannot resolve even a still picture at 60Hz on account of the stobing, something that my friends had a lot of fun with while CRTs were still commonplace...) so strobing really does nothing to solve the problem for me either.

Unfortunately, it seems that all of the OLED panel manufacturers will only be implementing 'Active Matrix' panels due to their lower power consumption and slim profiles, quality be damned.
Is there some specific reason why you believe that AM-OLED displays (not just the Samsung specific AMOLED, lol) will do anyting to change the quality of motion resolution on modern displays?

MistahBonzai wrote:

I could use some help finding where I can set overlap: to 0.

Why would you EVER want to do that? hmm
Even if your pc can't handle 4k without it, downsampleing a 4k video to 1920x1080 with something sharp (i.e. Lanczos4) and letting SVP interpolate that with block overlapping set to 1/4 (or more preferrably to 1/2), then finally upsampleing it back to 4k with the same algorithm will yield a much smoother video (due to better noise rejection and motion vector cohesion). This will also result in much less artifacts, especially on a high-resolution 4k monitor.

The only problem with this method would be the extreme loss of chroma resolution, resulting in a epic case of chroma blocking (and probably a fair amount of chroma bleed aswell), since SVP (or more accurately, MVTools) will not accept anything but the lowest allowed chroma quality (4:2:0 subsampled).

Nintendo Maniac 64 wrote:

If you're downscaling to 1080p, then make absolute sure you're getting 4:4:4 chroma since being able to get a full chroma resolution is one of the benefits of downsizing 4k video to 1080p.

How do you get full 4:4:4 chroma through SVP?
Or did you mean downscaling after SVP has processed the 4:2:0 4k clip?

36

(4 replies, posted in Using SVP)

In combination with what dlr5668 suggested, you could also try lowering your "Processing Threads" setting (it will reduce performance, but also results in much better stability if you are near the 2/3/4GB memory limit.

Another thing to try in combination with the 4gb_patch is to reduce your "SetMemoryMax" to 768 (512 may or may not work/help).

37

(45 replies, posted in Using SVP)

Chainik
Please correct me if I misunderstood, but surely that search radius, combined with how far the block has moved, also has an impact on CPU usage?
For example, if a block moves further than the current search radius, then a suitable motion vector will not be found and SVP will have to do an additional "refined" search (if the "bad SAD" limit is exceeded), correct?

Ps. Does SVP recalculate the limit for what it considers to be "detailed and bright" areas when only dark scenes are displayed (i.e. as a fraction of the maximum brightness of a block in the current scene) for a number of consecutive frames, or does it continue to consider the entire frame's blocks to be 'unimportant' or 'less visible'?

38

(45 replies, posted in Using SVP)

dlr5668
No problem, I'm glad I could contribute something.
And thanks for reading through all of it, I just wanted to ensure people know WHY we change certain settings as opposed to just "higher numbers must be better".  smile

39

(45 replies, posted in Using SVP)

dlr5668

TL;DR: skip to the end for my settings.

Well first I'd like to give the reasoning behind my settings. Anime content poses some unique challenges; specifically, only panning scenes or computer-generated content actually have a solid mathematical relationship (where motion is concerned) between past and future frames. The rest of the content, however, is normally hand-drawn at some fraction of the video's frame rate (i.e. only drawing 6 or 12 changed frames per every 24 frames of motion).
Since SVP uses a block-matching algorithm, (and artists re-drawing a character multiple times to create the illusion of motion rarely draw an exact 3D projection every time) very few blocks will match up close to perfectly (unlike the case with 'normal' video). Furthermore, the stop-start nature of the motion (caused by drawing fewer moving frames than the frame rate requires) also makes it hard for SVP to calculate the true motion since as far as it is concerned, that is how the motion is supposed to be. SVP has no way of knowing when two equal frames are the result of a production being too cheap to even draw 24fps motion, rather than simply signaling that the moving object has come to a halt. This leaves us with greatly increased artifacts (mostly in the form of solid lines becoming 'wavey' as shown in The SVP Wiki) or with the motion just not being any smoother, even for slowly moving or rotating objects.
Yes, in theory one can write some sort of algorithm to check for and remove the superfluous frames (while leaving legitimate motion-free portions alone) and then use SVP-framedoubling to interpolate the resulting variable frame rate video, but that is way beyond the scope of dlr5668's question...

As for the SVP settings, we have one of two choices:
1) We can only smooth reliable, global panning motion (produces fewer artifacts but does almost nothing to moving objects or characters)
2) Live with the increased artifacts and try to tweak SVP's settings as best we can for each specific anime's drawing style.

I prefer option 2 (but you can easily adapt my settings for 1 by using a very large 'motion vectors grid' setting, disabling 'decrease grid step' and setting motion vectors precision to one or two pixels).
These are my settings for MOST anime content:

Profile Settings:
Processing Threads: 24 (use less for more stability but less performance or for quad-core CPUs)
GPU-acceleration: Pick my second GPU (One not used by MadVR)
Frames Interpolation Mode: Adaptive (or "2m", depending on the severity of motion artifacts)
SVP shader: 2. Sharp
Target Frame Rate: Source multiplied by 3 (use "To screen refresh rate" for 60Hz monitors)
Motion vectors grid: 14px. Average 1 (for 720p: 8px. Small 9)
Decrease grid step: By two with global refinement
Search radius: Large
Motion vectors precision: One pixel (for 720p: Half pixel)
Wide search: Strongest
Artifacts masking: Weak
Processing of scene changes: Repeat frame
Decrease frame size: Disabled (this setting makes very little sense to me, I suppose it is for compatibility with very slow PCs?)

The following settings are best left to the profile defaults unless you have read through The svpflow dcumentation and The SVP MVTools2 dcumentation and understand what they do (can decrease both quality and performance).

Hidden settings: (SVPMgr.ini, override.js, etc. Very content dependent)
SetMemoryMax(768)
analyse.main.search.distance = -16 or -32
analyse.main.search.satd    = true (better quality but requires much more performance)
analyse.main.search.coarse.width = 1080 (set this >= to your video width)
analyse.main.search.coarse.distance    = -16 or -32
analyse.main.search.coarse.trymany    = (depends, can improve or worsen performance)
analyse.main.search.coarse.bad.sad    = 1000 (750~1250)
analyse.main.search.coarse.bad.range = -32 or -64
analyse.main.penalty.lsad    = 4000 (4000~8000)
analyse.main.penalty.pnew = 40;
analyse.main.penalty.pglobal = 40
analyse.main.penalty.pzero = 75;
analyse.refine[0].thsad = 300;
analyse.refine[0].penalty.lsad = 6000;

40

(45 replies, posted in Using SVP)

Mystery wrote:

With 3.1.6, I was still using updated libraries so upgrading to the latest package didn't give improvements in itself.

That actually explains it. If I change the libraries to the new ones then I also get identical performance between 3.1.6 and 3.1.7, which makes sense since I believe all performance inprovements should come from the dll libraries (the SVP manager should not use any substantial amount of resources). Also, if there were any architecture-specific improvements, I believe that Mystery's Ivy-Bridge based processor would also have benefitted from it as Ivy-Bridge also supports the AVX2 instruction set.

I also downloaded and tested that Youtube video after I disabled 4 CPU cores in the BIOS and backed off the "threads" setting to better aproximate more "mainstream" systems, and I still seem to get around a 10% performance improvement (lower total rendering time) with the new libraries.

I think that the reduced performance for some scenes can best be explained by SVP having to do more work to find suitable motion vectors when there are many different moving objects having large motion displacements, as compared to still or slowly moving frames.
If I change my SVP settings to those more suited to interpolating Anime content, SVP's performance and quality on that video also improves quite markedly.

Ps. You can tweak this scene-dependant adaptive behaviour by changeing the "badSAD" values and altering how far and with what algorithm SVP should search for replacement motion vectors; however the default values seem to be optimal for most mainstream 4-core machines and videos.

41

(45 replies, posted in Using SVP)

Lol, that might be so, but I was actually trying to figure out why Mystery experiences SVP 3.1.7 as not being universally faster than 3.1.6.
I can understand my system being able to handle higher settings than his, but 3.1.7 should also be faster for him shouldn't it?

42

(45 replies, posted in Using SVP)

Nintendo Maniac 64 wrote:

But what is your exact CPU model?

Intel 5960X @ 4.6GHz (I also have it in the "System Specs" area of my forum profile), why do you ask?

Do you think the performance improvement is maybe only for newer processor architectures?
Seems kinda unlikely to me as we are still stuck with only SSE2 optimizations...

43

(8 replies, posted in Using SVP)

seanami
Maybe you could also try setting "processing threads" to 24 or 26.

The reason I am suggesting that is because I have the exact same CPU model as you (and it seems the same monitor as well, Asus ROG Swift, no?) and 26 threads allows me to completely "max out" the normal profile settings without dropping frames, while lowering the number of threads results in reduced CPU utilization and the accompanying frame drops that go with it (if the quality settings are not lowered).
I do get the occasional script restart due to "memory leak detection" and sometimes MPC-HC crashes, but the increased interpolation accuracy is well worth it to me.

As always though, if you are happy with the current performance at 19 threads then it might not be the best idea to fix what is not broken... smile

Ps. I saw your MadVR settings screenshot and have you ever tried increasing your "render queue" so that it is more than your "present queue"? It might also help with stability and smoothness.

44

(45 replies, posted in Using SVP)

Hmm that sounds really weird, since after I upgraded to 3.1.7 I can't seem to get more than 50% CPU utilization even if I completely max out all the normal menu options for 60Hz interpolation. Interpolating to 144Hz does "solve" the "issue" for me, but I wonder why you aren't seeing constantly less CPU usage than with 3.1.6.

Would you mind posting your full system specifications and your SVP settings?
You can also try to change the "Index&CPUControl" value from 1 to 0 (under "Hidden settings") to see if it makes any difference altough I wouldn't recommend keeping it disabled for normal usage.

With regards to your Avisynth version, the SVP developers have (as I understand it) explained in some other threads that 2.5.8 "SVP edition" does indeed provide increased performance in the majority of cases. Useing 2.6 could be makeing some existing problem worse?

Ah right, you figured it out while I was typing  big_smile
Have you tried playing back the same file with a clean reinstall of 3.1.6?

-Ps. Is there some way to merge double posts?

Hi there Mystery,

You can try to disable the "Auto crop black bars" feature under the "Frame crop" menu option. Also select "Disabled" in the same submenu and see if those two settings makes a difference. Also check your logs and please report back if that "worked around" the issue.

For what its worth, my crash logs also contain those "Frame # crop detected" parts, while I seem to achieve stable playback in around 1-2 seconds. I'm just speculating here, but it might be that your video has some constantly changeing border cropping that SVP is trying to make sense of?

Yes I agree, but the links Chainik posted have very little to do with what I have proposed.
The links he posted refer to the use of completely different algorithms than those I have linked to, specifically they use deep networks, while I proposed using continuous partial derivatives which (if you check the objective Middlebury benchmark) beat both block-based matching and trained neural networks on multiple performance measures (such as endpoint, angular and interpolation errors).

Also, the methods I posted run quite a bit faster than these deep networks (again, refer the speed results on the Middlebury benchmark), however I am not advocating that any reasonably accurate optical flow implementation (except for fast block-matching methods) should be expected to perform in real time on any type of CPU.
For instance, 4x AMD Fury x'es have a combined theoretical computational capability of 34.4 TFLOPS, while the fastest multicore workstation CPU has a maximum computational throughput around 100~200 times less (while also costing quite a bit more).
Furthermore, the matrix inversions and 2D convolutions that are required to implement these gradient-based methods map quite well to general GPU resources, allowing optical flow routines to make very good use of most of the available GPU resources (as long as there is at least one low latency CPU thread available to feed them).

One caveat, however, is that almost all of these methods have been published quite recently, with either no implementing program available or (if it is available) one that is normally not very optimized for speed or for hiding artifacts in real-time video.

Hmm, I know what you mean Nintendo Maniac, but unfortunately I do not think that more developer attention would be able to massively improve the performance of APUs, as there are a few fundamental problems with the promised performance of these HSA systems:

1) Consumer interest seems to be driving a new race to the bottom, where companies focus on systematically developing
    less and less powerful (euphemistically referred to as "efficient") processors every couple of generations.
   
    A lot of people seem to believe that it is a lack of competition, or even insurmountable technological problems, that
    confines us to single-digit percentage point improvements in real-world CPU performance (when both are overclocked
    as far as stability allows, the difference between a Sandy-Bridge based CPU and all of the following architectures,
    including the latest Broadwell chips, is practically nonexistent), together with a greatly slowed rate of GPU performance
    increases as well.

    However; the truth is that the IC fabrication plants are optimized to produce "low power" transistors. Even their "High-
    Power" processes were engineered with 10-30W processors in mind. With almost every reviewer also shouting the
    praises of power being more important than performance, no company is going to be able to convince its directors to
    take the risk of investing billions of dollars into developing a true high-performance manufacturing node.
   
    To illustrate the crux of the matter, please have a look at the following graph (unfortunately it is already 6-years out
    of date, but the general trends are still valid. Also, please excuse the size of the image, none of the BBCode resize tags
    seems to work...):
290x289 width=290 height=289
    This image was taken from The Free Lunch is Over, where the basics of the problem is explained in much more detail.
   
    Alright, so what does all of this have to do with HSA development? Well, forcing the GPU to occupy the same space and
    power delivery circuitry as the CPU only serves to exacerbate the problem. This arrangement might not have been so
    bad, if we already had enough performance to process high-resolution and high-framerate video. It might even have
    brought some performance advantages (due to a, theoretically, much faster interconnect fabric).
    As it stands, however, the PCI-Express bus does not really bottleneck the performance of a well-written algorithm (you
    can prepare each frame on the CPU and only send it once over the bus, having the GPU then display it as soon as it
    has finished applying the required calculations. As an added bonus, only the source frames would need to be
    transferred, all interpolated frames can simply be drawn straight by the GPU.

    It seems like the only real "workaround" is to make use of multiple "add-in" cards to bring the system's power up to
    something reasonable.

2) Even with the extra theoretical performance that an integrated GPU would bring, almost all "real-time" programs rely
    on a single logical thread to control the program's flow and to dispatch, collect and organize the flow of data and
    commands. Of special importance, is this "master" process' ability to quickly (with the lowest possible latency) resolve
    race conditions and deadlocks that arise due to the asynchronous nature of multi-threaded processing (see also
    Amdahl's Law).
    Obviously, a given processor's instruction latency is directly dependent on its instruction throughput per clock cycle
    (it's IPC) times its clock rate which, as can be seen in the above image, have both basically stagnated together with
    the processor's power budget.

These are the main reasons why I am so excited about the implementation of new SIMD instructions, as they are able to massively increase (double in the case of AVX2) the amount of data that can be operated on in parallel by a single thread, without either the clock speed or the IPC having to be increased.

I do apologize for the length of this post (and for my English), but some of the issues involved are somewhat complicated.
Hopefully I (or rather the linked articles big_smile ) have adequately explained how "making increased use of the increased parallel processing power of HSA" by breaking up traditionally serial logic pathways and distributing them over the integrated GPU, can actually slow down the program as a whole.

As for your comment about current hardware already being able to "max out" SVP for 1080P videos; I totally agree.
(As an interesting aside, for some types of content, having a larger "motion vectors grid" setting results in less artifacts and better detail preservation. So, "maxed out" settings do not always have to be the most computationally intensive either smile ).

Anyway, when MVTools was written, block-matching was the only remotely practical way to interpolate motion in real time on the hardware of the day. While this is very fast and works great for 2D planar motion (such as the sideways panning of a camera) it does not take into account factors such as depth or occlusion (such as when a moving object or character is partially obstructed by a channel logo or a picket fence) or even of the connectedness of solid lines (a straight edge should remain straight, even when in motion).
And while such 2D planar motion does make up a very large share of the most obnoxious motion artifacts of low-framerate video, I do hope that we will one day have some form of motion interpolation software that is based upon continuous partial derivatives and accurate phase correlation; if not in this decade then hopefully at least during the early 2020s, because I am sure we at least have enough computational power to pull it off.

Hey, thanks for all the replies guys, I really do appreciate it.

I also agree with the idea that optical flow algorithms should be able to scale pretty well, since it is a "embaressingly parallel" problem.

Also, with CPU performance improvements stagnating since pretty much the start of the decade, specific optimizations like AVX2 would allow for almost double the performance compared to the currently most used SSE2 and MMX assembly optimizations (AVX2 doubles the amount of integers that can simultaneously be operated on).

But the most interesting improvement (to me at least) is that these new algorithms can actually segment an image based upon foreground and background objects and then also track those objects' movements and re-draw the intact object according to the best motion vector extracted from the continuous motion field (thereby reducing "wavey" or "shimmering" artifacts) big_smile .

Hi Chainik (and the rest of the SVP Dev. Team)

I was wondering if you would consider adding a "High-Quality" interpolation algorithm to SVP4,
using the improvements introduced in MDP-Flow2
or the more recent PMMST ("High accuracy correspondence field estimation via MST based patch matching" by Zang et al.)
as a separate stretch goal or fundraiser (I would of course be willing to donate a few $100).

If there is enough interest you could perhaps include it as a paid feature (for around $250 maybe?) of SVP4 for people with relatively high-end PCs (looking over the algorithms, I suspect around 64GB of RAM and 20~30TFlops of GPU throughput would be required, which should not be too much of a problem).
I believe (or hope, rather) that I would not be the only one who finds the massively improved prediction quality, refer to the performance on the Middlebury Benchmark, worth paying for. smile