Topic: New High-Quality Optical Flow Algorithms for SVP4?

Hi Chainik (and the rest of the SVP Dev. Team)

I was wondering if you would consider adding a "High-Quality" interpolation algorithm to SVP4,
using the improvements introduced in MDP-Flow2
or the more recent PMMST ("High accuracy correspondence field estimation via MST based patch matching" by Zang et al.)
as a separate stretch goal or fundraiser (I would of course be willing to donate a few $100).

If there is enough interest you could perhaps include it as a paid feature (for around $250 maybe?) of SVP4 for people with relatively high-end PCs (looking over the algorithms, I suspect around 64GB of RAM and 20~30TFlops of GPU throughput would be required, which should not be too much of a problem).
I believe (or hope, rather) that I would not be the only one who finds the massively improved prediction quality, refer to the performance on the Middlebury Benchmark, worth paying for. smile

Re: New High-Quality Optical Flow Algorithms for SVP4?

Bump for good idea. I would like to hear dev opinion on this.

Re: New High-Quality Optical Flow Algorithms for SVP4?

xenonite
Interesting. Thanks.

I suspect around 64GB of RAM

hmm... sounds promising big_smile

Re: New High-Quality Optical Flow Algorithms for SVP4?

Then you can convert videos offline where you don't care about performance, you just let the computer compute all night and then enjoy your video

5 (edited by Nintendo Maniac 64 16-07-2015 19:49:39)

Re: New High-Quality Optical Flow Algorithms for SVP4?

It's also good to have higher-quality algorithms and such for better use of future mainstream hardware*; I mean, SVP is one of the things that can actually take advantage of "moar cores" and the like.


*it's arguable that modern high-end hardware already maxes out SVP for 1080p video (which is likely to be standard for a while), and today's high-end performance level is tomorrow's mainstream performance level.

Re: New High-Quality Optical Flow Algorithms for SVP4?

Hey, thanks for all the replies guys, I really do appreciate it.

I also agree with the idea that optical flow algorithms should be able to scale pretty well, since it is a "embaressingly parallel" problem.

Also, with CPU performance improvements stagnating since pretty much the start of the decade, specific optimizations like AVX2 would allow for almost double the performance compared to the currently most used SSE2 and MMX assembly optimizations (AVX2 doubles the amount of integers that can simultaneously be operated on).

But the most interesting improvement (to me at least) is that these new algorithms can actually segment an image based upon foreground and background objects and then also track those objects' movements and re-draw the intact object according to the best motion vector extracted from the continuous motion field (thereby reducing "wavey" or "shimmering" artifacts) big_smile .

7 (edited by Nintendo Maniac 64 16-07-2015 20:52:21)

Re: New High-Quality Optical Flow Algorithms for SVP4?

Personally I'm more interested in the likes of HSA seeing how SVP is already capable of using GPU-acceleration.  However, it looks like HSA as a whole will receive very minimal interest from software developers until Zen APUs are a thing.

Re: New High-Quality Optical Flow Algorithms for SVP4?

Hmm, I know what you mean Nintendo Maniac, but unfortunately I do not think that more developer attention would be able to massively improve the performance of APUs, as there are a few fundamental problems with the promised performance of these HSA systems:

1) Consumer interest seems to be driving a new race to the bottom, where companies focus on systematically developing
    less and less powerful (euphemistically referred to as "efficient") processors every couple of generations.
   
    A lot of people seem to believe that it is a lack of competition, or even insurmountable technological problems, that
    confines us to single-digit percentage point improvements in real-world CPU performance (when both are overclocked
    as far as stability allows, the difference between a Sandy-Bridge based CPU and all of the following architectures,
    including the latest Broadwell chips, is practically nonexistent), together with a greatly slowed rate of GPU performance
    increases as well.

    However; the truth is that the IC fabrication plants are optimized to produce "low power" transistors. Even their "High-
    Power" processes were engineered with 10-30W processors in mind. With almost every reviewer also shouting the
    praises of power being more important than performance, no company is going to be able to convince its directors to
    take the risk of investing billions of dollars into developing a true high-performance manufacturing node.
   
    To illustrate the crux of the matter, please have a look at the following graph (unfortunately it is already 6-years out
    of date, but the general trends are still valid. Also, please excuse the size of the image, none of the BBCode resize tags
    seems to work...):
290x289 width=290 height=289
    This image was taken from The Free Lunch is Over, where the basics of the problem is explained in much more detail.
   
    Alright, so what does all of this have to do with HSA development? Well, forcing the GPU to occupy the same space and
    power delivery circuitry as the CPU only serves to exacerbate the problem. This arrangement might not have been so
    bad, if we already had enough performance to process high-resolution and high-framerate video. It might even have
    brought some performance advantages (due to a, theoretically, much faster interconnect fabric).
    As it stands, however, the PCI-Express bus does not really bottleneck the performance of a well-written algorithm (you
    can prepare each frame on the CPU and only send it once over the bus, having the GPU then display it as soon as it
    has finished applying the required calculations. As an added bonus, only the source frames would need to be
    transferred, all interpolated frames can simply be drawn straight by the GPU.

    It seems like the only real "workaround" is to make use of multiple "add-in" cards to bring the system's power up to
    something reasonable.

2) Even with the extra theoretical performance that an integrated GPU would bring, almost all "real-time" programs rely
    on a single logical thread to control the program's flow and to dispatch, collect and organize the flow of data and
    commands. Of special importance, is this "master" process' ability to quickly (with the lowest possible latency) resolve
    race conditions and deadlocks that arise due to the asynchronous nature of multi-threaded processing (see also
    Amdahl's Law).
    Obviously, a given processor's instruction latency is directly dependent on its instruction throughput per clock cycle
    (it's IPC) times its clock rate which, as can be seen in the above image, have both basically stagnated together with
    the processor's power budget.

These are the main reasons why I am so excited about the implementation of new SIMD instructions, as they are able to massively increase (double in the case of AVX2) the amount of data that can be operated on in parallel by a single thread, without either the clock speed or the IPC having to be increased.

I do apologize for the length of this post (and for my English), but some of the issues involved are somewhat complicated.
Hopefully I (or rather the linked articles big_smile ) have adequately explained how "making increased use of the increased parallel processing power of HSA" by breaking up traditionally serial logic pathways and distributing them over the integrated GPU, can actually slow down the program as a whole.

As for your comment about current hardware already being able to "max out" SVP for 1080P videos; I totally agree.
(As an interesting aside, for some types of content, having a larger "motion vectors grid" setting results in less artifacts and better detail preservation. So, "maxed out" settings do not always have to be the most computationally intensive either smile ).

Anyway, when MVTools was written, block-matching was the only remotely practical way to interpolate motion in real time on the hardware of the day. While this is very fast and works great for 2D planar motion (such as the sideways panning of a camera) it does not take into account factors such as depth or occlusion (such as when a moving object or character is partially obstructed by a channel logo or a picket fence) or even of the connectedness of solid lines (a straight edge should remain straight, even when in motion).
And while such 2D planar motion does make up a very large share of the most obnoxious motion artifacts of low-framerate video, I do hope that we will one day have some form of motion interpolation software that is based upon continuous partial derivatives and accurate phase correlation; if not in this decade then hopefully at least during the early 2020s, because I am sure we at least have enough computational power to pull it off.

9 (edited by Nintendo Maniac 64 17-07-2015 04:22:05)

Re: New High-Quality Optical Flow Algorithms for SVP4?

Regarding the performance of APUs, there's a reason I mentioned Zen APUs - I believe the inadequate level of traditional x86 performance of current AMD APUs is largely the reason why there is minimal development as a whole regarding HSA since, if the traditional x86 performance isn't there, then performance-focused developers won't even own the APUs in question and therefore will not develop for it, and even if they did own it they could very well see it as a "why bother?" with the logic of an APU owner clearly isn't interested in CPU performance because otherwise they would have purchased an non-APU product in the first place (though technically even Haswell is an APU).


Basically, it's very difficult to convince developers to develop for something new when the competition is considerably better at the traditional method to the point that your competitor is the general "go to" product thereby resulting in a lack of developers even owning your product.


Oh, and fun fact: Haswell is quite a bit faster than Sandy Bridge at all-things emulation for some unknown reason.

Re: New High-Quality Optical Flow Algorithms for SVP4?

Nintendo Maniac 64 wrote:

*it's arguable that modern high-end hardware already maxes out SVP for 1080p video (which is likely to be standard for a while), and today's high-end performance level is tomorrow's mainstream performance level.

How is today's hardware maxing out SVP?

Luckily, for HSA you don't have to rely on the APUs alone to drive development as Nvidia will soon be joining AMD in having GPU/CPU UMA with its discrete GPUs and the currently-larger dGPU market will drive HSA improvements that will "downscale" to the currently-smaller APU market.

Additionally, Apple is embracing OpenCL and HSA and they're a player able to effect change in the industry. 

FWIW.

Re: New High-Quality Optical Flow Algorithms for SVP4?

xenonite
I do hope that we will one day have some form of motion interpolation software that is based upon continuous partial derivatives and accurate phase correlation; if not in this decade then hopefully at least during the early 2020s, because I am sure we at least have enough computational power to pull it off.

http://www.technologyreview.com/view/53 … ld-images/

https://www.youtube.com/watch?v=cizgVZ8 … e=youtu.be

it takes 12 minutes on a multicore workstation to produce a single newly synthesized image

smile

Re: New High-Quality Optical Flow Algorithms for SVP4?

These clips have SERIOUS artifacts.

SVP does a better job in real-time smile

13 (edited by xenonite 17-07-2015 20:15:31)

Re: New High-Quality Optical Flow Algorithms for SVP4?

Yes I agree, but the links Chainik posted have very little to do with what I have proposed.
The links he posted refer to the use of completely different algorithms than those I have linked to, specifically they use deep networks, while I proposed using continuous partial derivatives which (if you check the objective Middlebury benchmark) beat both block-based matching and trained neural networks on multiple performance measures (such as endpoint, angular and interpolation errors).

Also, the methods I posted run quite a bit faster than these deep networks (again, refer the speed results on the Middlebury benchmark), however I am not advocating that any reasonably accurate optical flow implementation (except for fast block-matching methods) should be expected to perform in real time on any type of CPU.
For instance, 4x AMD Fury x'es have a combined theoretical computational capability of 34.4 TFLOPS, while the fastest multicore workstation CPU has a maximum computational throughput around 100~200 times less (while also costing quite a bit more).
Furthermore, the matrix inversions and 2D convolutions that are required to implement these gradient-based methods map quite well to general GPU resources, allowing optical flow routines to make very good use of most of the available GPU resources (as long as there is at least one low latency CPU thread available to feed them).

One caveat, however, is that almost all of these methods have been published quite recently, with either no implementing program available or (if it is available) one that is normally not very optimized for speed or for hiding artifacts in real-time video.