1111111111111 . Software and Hardware Implementations of Stereo Matching 1Li Zhou, 2Tao Sun3, Yuanzhi Zhan and 3Jia Wang 1School of Information Science and Engineering, Shandong University, P. R. China 2Shandong Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, P. R. China 3School of Information Science and Engineering, Shandong University P. R. China 1e-mail:

[email protected], 2e-mail:

[email protected] Abstract Stereo matching is one of the key technologies in stereo vision system due to its ultra high data bandwidth requirement, heavy memory accessing and algorithm complexity. To speed up stereo matching, various algorithms are implemented by different software and hardware processing methods. This paper presents a survey of stereo matching software and hardware implementation research status based on local and global algorithm analysis. Based on different processing platforms, including CPU, DSP, GPU, FPGA and ASIC, analysis are made on software or hardware realization performance, which is represented by frame rate, efficiency represented by MDES, and processing quality represented by error rate. Among them, GPU, FPGA and ASIC implementations are suitable for real-time embedded stereo matching applications, because they are low power consumption, low cost, and have high performance. Finally, further stereo matching optimization technologies are pointed out, including both algorithm and parallelism optimization for data bandwidth reduction and memory storage strategy. Keywords: Stereo Matching, GPU, FPGA, ASIC 1. Introduction The common ground of stereo vision systems is to model three-dimensional (3D) space and to render 3D objects, using depth information that is the most important element of stereo vision 1111111111111 . systems [1]. Stereo matching is one of the most active research topics concerning on the depth information processing capability. It is an important stereo vision technique to extract depth or disparity information from stereo images obtained from slightly different viewpoints, by calculating every pixel’s depth information from stereoscopic images. It is widely used in different applications, such as stereo E) is given by a set of vertices V and a set of edges E (sometimes, V is called nodes, E is called links). V usually corresponds to pixels or features extracted from an image, while E encodes spatial relationships. GC also has heavy computational complexity and memory requirements. It takes O(L3) calculation iterations, which increases fast with the number of labels increases. That is why even with the state-ofthe-art numerical optimizers, GC is hard to produce in an acceptable real-time processing manner. GC method yields competitive results, and is proved to be able to handle occlusion reasoning well, but its min-cut technique is used to minimize sub-modular energy that is prone to be a partial labeling problem. Some GC based methods are summarized in Table 2. Table 2. GC Algorithm Overview Method Optimization Technique Traditional GC [41] (1998) Finds a minimum multi-way cut on a graph, gives out a greedy method for computing a multi-way cut Local-minimum GC [42] (2001) Finds a local minimum with respect to expansion moves and swap moves Plane based GC [43] (2004) Scene structure is represented as a set of planes in the disparity space Smoothness GC [44] (2007) Develops optimal expansion and swap algorithms for truncated convex priors, only two labels for each pixel ST-mincut GC [45] (2007) Introduces a st-mincut approach to speed up the inference process with slightly different energy terms 1111111111111 . Template based GC [46] (2011) Non-overlapping templates are derived from reference image to represent current scene with assigned disparity value 2.3. DP Method DP decomposes a problem into a set of sub-problems, then efficiently solves them recursively. It lies in the middle of the spectrum with reasonably matching performance at the expense of relatively large storage memory. Algorithm complexity is O(K×D2), where K is the number of pixels per scan-line, and D is the disparity range. The major problem of DP is that inter-scanline consistency cannot be well enforced, leading to the well-known “streaking” artifacts. Table 3 gives out an overview of various DP based stereo matching methods. Table 3. DP Algorithm Overview Method Optimization Technique Traditional DP [47] (1985) Employs inter-scanline search for possible connected edges and intra-scanline search for edge-delimited intervals Tree-based DP [48] (2005) Proposes a minimum spanning tree on the adjacency-graph of an over-segmented image instead of individual scan-lines Two-pass DP [49] (2005) Employs a two-pass DP combined with generalized ground control scheme, optimizing along and across the scan-lines Reliability DP (RDP)[50] (2005) Introduces a reliability DP based on the cost difference between the best alternate path and the path under use ORDP [51] (2005) Orthogonal Reliability DP (ORDP) generates semi-dense disparity maps using only two DP passes based on [50] Adaptive DP [52] (2006) Introduces an adaptive aggregation step in the vertical direction based on [50] Tree-based DP (TDP)[53] (2006) Proposes a tree structure DP, one pixel disparity estimate depends on other tree pixels Liner Model DP [54] (2007) Apples truncated linear model to both horizontal and vertical line dependence function TDP [55] (2009) Employs geodesic distance transformation for multiple tree construction according to image geodesic distance Combined DP [56] (2009) Combines vertical aggregation and DP scheme to produce disparity maps Multi-resolution DP [57] (2010) Uses inter scan-line consistency and scene constraints directly into disparity calculation Rank Transform DP [1] (2011) Uses rank transform based matching function, and uses adaptive interaction among neighboring disparities Hybrid DP [58] (2012) Combines cross-based adaptive window aggregation and basic dynamic programming 1111111111111 . 2.4. Window Based Matching Window based stereo matching algorithm belongs to local matching category. It aggregates matching cost over a given support window. Local window should be large enough to include sufficient intensity variation for matching operation, and be small enough to avoid disparity variation inside the window. Properly designed cost function and selected window type are fundamentals of window based stereo matching method. Fixed window, rectangular window [21], multiple windows [22], adaptive weight (AW) [23], and epipolar geometry-based window [24] and adaptive shape window [25] are proposed in the stereo algorithm. Cost functions designs, such as SAD, rank [26], census [27] transform, are all valid for cost calculation. Rank and census transforms are two nonparametric transforms, depending on the relative order of pixel values rather than the pixel values them. The differences between rank and census method are that rank method counts the number of pixels in a window which is less than the center pixel, while census maps a pixel window to a bit string. Both methods have an algorithmic structure suitable for hardware implementation, and are invariance to certain types of image distortion and noise. Combined with other optimization technique, window based matching algorithms can reach to reasonable quality and performance [23]. computes pixel-wise adaptive support-weights based on proximity and color distances to center pixel. [28] Uses spatio-temporal correlation and temporal variation of the disparity field [29]. Limits search range around the basic line for fast search [30]. Replaces disparity estimation with planes in texture less regions [31]. Introduces partial sum method based on AW to reduce pixels information in a large window. 2.5. Affine Transformation Method Scale Invariant Feature Transform (SIFT) method [59] can extract distinctive invariant features from images that are invariant to image scale, rotation, 3D viewpoint, noise, illumination change, and match densely pixel-wise SIFT features between two images while preserving spatial discontinuities [60]. Affine-SIFT (ASIFT) [61] can identify features that undergo very large affine distortions. However, the huge amount of computations required by multiple cascaded transformations makes SIFT difficult to achieve real-time performance [62]. Presents an overview of SIFT approaches based on general purpose multi-core processors, customized multi-core processor and FPGA implementation. Phase Singularity (PS) represents to a point where a complex signal equals to zero. In stereo matching, PS is estimated by convolving images with complex filters. Compared with SIFT method, PS-based approaches have the advantage of robustness against variations in luminance and imbalance between stereo image sensors at the cost of higher computing resource requirements [63, 64]. Combines PS with SIFT, which can get higher repeatability rates. 1111111111111 . 2.6. Other Intelligence Method Neural based method can also reach a reasonable stereo matching quality [15]. Its most distinct characteristic is that it does not require matching between left and right elements. Instead, the binocular stimuli with a specific disparity are matched with binocular neurons in the form of neuronal responses. Different neurons with different preferred weights patterns can indicate the spatial left and right receptive fields. Thus, the response of a neuron indicates a matching degree of two receptive fields [13]. Firstly presents a neurotrophic and spatiotemporal regression model of the laminar cortex architecture for stereo that is able to perform stereo disparity detection competitively with sub-pixel precision. In [65], a multi-layer inplace learning network was used to detect binocular disparities. Fuzzy set theory also can be used to deal with stereo matching [66]. Proposes a threshold-based segmentation to build interval-valued fuzzy sets [67]. Adapts biologically Disparity Energy Model to separate stereo populations, then trains these populations, and extracts disparity values in entire scenes. 3. Software and Hardware Processing of Stereo Matching Methods In the past decades, with stereo consumer application market flourishing, researchers are devoting more efforts to real-time software and hardware system design for stereo matching. Although High Performance Computing (HPC) can provide considerable computational power, there are still many implementation challenges of stereo matching: ? Ultra high computation complexity: searching all pixel candidacies within an area space needs tens of iterations to find the best matching point, calculates all candidacies matching cost with matrix multiplication or addition. Sub-pixel or higher dynamic disparity range case is even worse. These factors lead to ultra high computation complexity cost. ? Ultra high data bandwidth and on-chip SRAM size requirement: are caused by massive temporary data exchanging required by algorithms. However, memory is always the bottleneck for all systems. Optimized data reuse, parallelization, or hierarchical schemes should be researched for memory related issues. ? Real time processing and low power consumption requirement: needs advanced accelerator processing schemes with both power consumption and high processing performance. The real-time and power issue will be existing for a long time because processing requirement is increasingly faster than processing capability. ? Irregular algorithm parameter selection and additional pre-/post-processing steps: needs to be carefully considered to resolve occlusion, inconsistent, or irregular issues in stereo matching algorithm. To resolve the above requirement and bottlenecks, strives needed in three directions: algorithms optimization, software acceleration and hardware acceleration. Software accelerator evolves in parallel optimization on CPU, DSP, and GPU. Hardware accelerators are based on FPGA or ASIC. 1111111111111 . 3.1. CPU Implementation CPU has the highest flexibility for stereo matching algorithms, but it has a limited acceleration for real-time calculation of dense disparity map because CPU has less specific acceleration processing unit. Early research work [50] only achieves non-video rate performance due to limited computing power [68]. Firstly implements software processing with low bit depth motion estimation algorithms for outdoor robot navigation [69]. Proposes a real-time stereo depth extraction method for an intelligent home assistant robot. [70] is able to achieve approximately 17fps on 640 × 480 images with 25 disparity range on a 2 GHz dualcore processor. 3.2. DSP Implementation DSP has better signal processing capability because of better data processing architectures, lower cost and less power consumption than CPU. Although DSP can reach reasonable stereo matching performance, it has inherent disadvantages, such as data word alignment, bandwidth throughput issue, etc. Consequently, high quality algorithm is seldom realized by a DSP system and is only limited to window based algorithms. With powerful multimedia accelerators, high system clock frequency, optimized cache usage and interconnections between cores, multi-core processor is an effective way to increase stereo matching performance [71]. However, simply increasing the number of processing elements comes with the cost of higher power consumption. In addition, there is no linear relationship between the number of processor cores and the processing performance. As a result, the GPU architecture appeared. 3.3. GPU Implementation GPU integrates hundreds of extremely powerful computation stream Processing Elements (PE) simultaneously, emphasizing coherent memory access, memory locality, and efficiency of data parallelism, with powerful floating-point arithmetic intensity and high memory bandwidth. For example, nVidia Tesla S1070 contains four GT200 processors, each has 240 PEs, and each processor provides about one Tera Flop (TF) of single-precision throughput over 100 GB/s memory bandwidth. As a comparison, 3.2 GHz Intel Core 2 Extreme can only operate at roughly 51.2 Gega Flops per second (GF/s). Some other GPU processing system, such as AMD FireGL, Qualcomm Snapdragon, and ST Ericsson’s U8500 are also widely used in PC or embedded systems. Software programming models, such as CUDA, OpenGL, and DirectX, are also developed by NVidia/AMD/Intel to assist General-Purpose Computation on GPU (GPGPU) programming for a broader community of application developers. Most stereo matching tasks perform the same computation on a number of pixels in parallel. So GPU stereo matching implementations are 1111111111111 . drawn much attention and obtained desirable speedup, which is benefited from GPU’s hundreds of PEs and high-level software development platform. It is crucial to design application-specific GPU stream kernels and exploit the inherent parallelism of algorithms to adapt