Our third piece on FPGA use cases discusses real-time video stitching. As the name implies, video stitching entails stitching multiple overlapping videos to generate a wide-field video. This allows for higher-resolution video post-stitch and more immersive experiences when applied to applications like augmented reality. Video stitching has been used in various fields such as sports broadcasting, video surveillance, street view, and entertainment.
It is a more complex extension of image stitching, which is among the oldest and most widely discussed topics in computer vision and graphics. Moving objects and camera movements adds factors such as shakiness, blurring and ghosting to the equation, bringing in an additional layer of difficulty.
There are various approaches to video stitching. This image gives a high-level overview of a typical stitching flow.
Step 1: Alignment and Calibration
Focal points of the cameras are tallied and a baseline is created for depth detection. Images are analyzed to find the relationship between them. The objective is to find the overlap and calculate how each image must be transformed to align the images appropriately.
Step 2: Feature Extraction and Matching
Feature extraction is the next step whereby distinct features are identified in each video input. These features are then matched against other video inputs and a decision can be made if an object is moving across overlapping camera regions.
Diving deeper, feature extraction is a process where a local decision is made at every image point to see if there is a distinct feature that can be recognized. As variations often appear due to changes in illumination or camera movement, it is important to have a model that can accurately estimate features.
Scale Invariant Feature Transform (SIFT) is a widely used technique where image data is transformed into coordinates relative to local features. It has proven to be extremely efficient in object recognition applications but remains a bottleneck due to the intense computation required.
For example, a SIFT execution on a typical image of size 500×500 pixels will give rise to roughly 2000 stable features, covering keypoint detection, orientation, and descriptor labeling - all within a tight timeframe. This complex computation requirement has thus far been a roadblock, especially in projects requiring real-time stitching.
FPGAs have a role to play here as the SIFT algorithm deals with matrix functions. Early experiments have shown that an FPGA based architecture is able to detect interest points in an image of 320 × 240 in 11 ms, representing a speedup of 250 times that of software implementation.
Matching is the next step in the process where the extracted features are compared against each other to look for similarities.
This step currently has the distinction of being the slowest part of the entire flow. It is fairly simple to understand: Features are extracted and compared against each other with a forgiveness margin. Feature points without a match within this margin are discarded. The slow speed is the result of having to match hundreds of thousands of feature points.
Fortunately, FPGAs also can add value here. The process of going through the features is a highly iterative one and hence well suited for parallel computation. Matching algorithms implemented on FPGAs are shown to increase the speed of execution due to the pipelining nature of the FPGA, with a speedup of about 27 times higher than that when using a PC-based implementation.
Step 3: Template Creation
In the next step, a stitching template is generated by stitching selected frames of video inputs. All subsequent frames are stitched according to this template, with the template being updated when matched features are identified to be moving across the overlapping regions between videos.
With these few simple steps, you now have the core requirements for creating a stitched video.
The opportunity with SIFT and other popular feature recognition algorithms is that they all involve matrix functions, which opens the door for FPGAs to accelerate and enhance the process. Studies have shown that significant improvements are attainable by bringing FPGAs into the mix, making real-time stitching much more attainable.
The future potential of real-time stitching is vast, enabling live object detection over a wide area, tracking on the edge for real-time analytics and even enhancing security systems. There are a significant number of startups in Asia working on these topics and we are actively connecting with the ecosystem.
We have encountered companies involved with sports analysis and augmented reality having difficulties achieving an acceptable level of real-time stitched video quality. High-end GPUs are beginning to show that they cannot handle the load alone, and an FPGA solution will be the logical next step.