# NABI: Low power, high speed FPGA based Novel Approach for Bilateral filter

Erulappan SAKTHIVEL, Veluchamy MALATHI, Muruganantham ARUNRAJA, and Govindaraj Perumalvignesh

Department of Electrical and Electronics Engineering, The Siliconnharvest, Madurai, Tamilnadu, India

Copyright © 2016 ISSR Journals. This is an open access article distributed under the *Creative Commons Attribution License*, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**ABSTRACT:** Bilateral filters use a wide range of medical and industrial applications. The limitations of conventional bilateral filter architecture are having a minimum kernel size and constant delay. This constant delay depends on two modules available in architecture such as the width of the image and sum of processing elements. Due to the inputs variation the kernel size can extent which may affect overall performance in terms of all the image quality assessment and performance in FPGA level (scalability, latency, power consumption). To evade this problem Low power, high speed FPGA based Novel Approach for Bilateral filter (NABI) are introduced. This NABI consists of Structure Shared Architecture (SSA), Master Control Unit (combination of intensity calculator and graph theory based traffic estimator), kernel based clock unit and Reconfigurable server. These components are described on the register transfer level implemented in VHDL. Depends upon the size of the kernel the reconfiguration is taking place via reconfigurable server. The intensity calculator is used to estimate the intensity of image and that intensity value is placed in normalization block to achieve better PSNR and MSE. This proposed NABI is implemented in a Virtex-5VLX50-1 device. The performance results in terms of FPGA level 31.69% slice reduction, 49.51% frame rate improvement, 28.96% power reduction and 50% latency reduction are achieved. The image quality assessment is also observed and compared with conventional algorithms. Thus, NABI work achieves better outcome than conventional work.

KEYWORDS: Bilateral Filter, image processing, real time processing, Field Programmable Gate Array (FPGA).

## **1** INTRODUCTION

In image processing domain Bilateral filtering plays vital role in reducing noise while preserving the basic information of an image [1]. Tomasi et al has a bilateral filter for grayscale and color image, which is a combination of photometric filter and geometric filter. In [2], introduced FPGA based bilateral filter for real time image processing applications. Highly parallelized pipeline structure and kernel based design with clock to the quadruple of the pixel clock in the filter architecture. The major contribution of Riesgo et al is an FPGA design architecture of the bilateral filter on register-transfer level (RTL). In [3], a direct clocking scheme with real-time processing and effective utilization of resources are used. In this design there is no external image storage register is required. Hannig et al introduced an almost automatic synthesis, throughput, improved architecture of an adaptive multi resolution filter (AMRF) [4]. This AMRF has used in medical and real time image processing for FPGAs. The experimental section of AMRF deals with latency, power and quality of the image. In [5], The DSP and FPGA implementations algorithms are discussed to reduce the complexity and timing constraints. Here holistic functional approach is enabled by setting up a unique modeling and evaluation environment.

In [6], Rodriguez et al has discussed about digital processing systems work is parallel architecture implemented in the FPGA, which gives an effective performance at lower cost, which is comparable with other parallel architecture. In [7], the author introduced An FPGA-Based Fully Synchronized Design of a Bilateral Filter (SDBL), which is having kernel based preprocessing to collect input, this architecture is a highly parallelized pipeline structure. This design is described on register transfer logic. The main limitation of this paper is the size of the kernel is limited 5x5, by which small range images are capable to perform a filter operation. Whereas in charoensak et al deal with 15x15 kernel size and dynamic range compression is based on a multi-scale decomposition of image [8]. In order to reduce the hardware complexity charoensak et al, no frame buffer is required. Here the latency is depending on two components such as the sum of the processing delay

element and image width. According to [8] the input side preprocessing is not required for bilateral filter architecture. In [7], experimental section it is reported and compared only Highly Parallel Architecture (HPA) [9] comparison of with SDBL, but in [9]. Structure Shared Architecture (SSA) is also presented and that work is giving Superior performance than HPA. This SSA work is not considered in [7]. This SSA is taken for our proposed design. Also in proposed design the advantage of [7] is synchronized clock and kernel based structure are considered.

The proposed work has following features are available, which low complexity in nature. This Low power, high speed FPGA based Novel Approach for Bilateral filter (NABI) is having following modules such as SSA, kernel based structure to perform synchronized clock model, Master Control Unit (MCU) and reconfigurable server. In [10] massively parallel processor array architectures are employed to progress bilateral filter in FPGA platform. The reconfiguration approach and the Graph theory based Traffic Estimation (GTE) concept are observed from sakthivel et al. This proposed NABI design to master control unit consist of GTE [10] and reconfiguration approach. This both modules are reported in [11], that modules are taken for MCU construction. In [12], the advantages of various architectures for bilateral filter are discussed in FPGA platform. In this paper multiple performances are analyses such as resource, delay, power consumption and maximum frame rate @ 150x150 images.

The rest of this paper is organized as follows. Section 2 addresses the conventional work. Proposed work and its module details are discussed in section 3. The experimental results are presented in section 4. Finally, the conclusion is presented in section 5.

## 2 CONVENTIONAL WORK

The conventional bilateral filter with HPA architecture is represented in Fig.1, which has three main modules such as kernel based architecture with synchronization clock module, photometric filter and geometric filter. This conventional work is implemented in Xilinx Virtex-5VLX50-1 device. The experimental section results are reported in Gabiger-Rose et al [7]. The limitation of [7] is already discussed in the introduction. In order to validate various datasets, as input with reconfiguration approach and the conventional HPA is refined into SSA. To provide better results than conventional work in terms of quality assessment and FPGA level NABI is introduced.



Fig. 1. Gabiger-Rose et al bilateral filter architecture

## 3 PROPOSED DESIGN

Fig.2 represent the proposed system, which has Kernel based Architecture (KA), structure shared architecture, new normalization unit, Reconfigurable Server and Master Control Unit (MCU). The image input is collected from the input side of KBA at the same time MCU and RS collect the input information. First MCU will respond and give the corresponding information to RS and SSA unit. This MCU consists of two components such as GTE and intensity estimator. GTE is used to estimate the traffic information and that will give the traffic id to RS. This RS will give reconfigurable information to the KA. According to this reconfigurable information this KA will activates the register and size. The benefit of SSA is low complexity in nature which is performed here at the end, both quality and FPGA level performance is achieved. The reconfigurable information part is refined from Gabiger-Rose et al [7]. The intensity calculator is available in NABI, which is used to estimate intensity according to [8] the normalization operation is performed without additional components.

#### 3.1 KERNEL BASED ARCHITECTURE (KA)

The finite state representation based KA is structured to provide effective complexity less architecture. This block is named as state representative of registers. It can process all the pixels in any form in one pixel cycle into the register matrix. In KA the register block produces the sorted output in groups and they are fed into the SSA component through the clock frequency. By multiplexing the pixels in a routine, sorting can be done. The counter is used to select the signal and it regulates the information of register block. If the overall register block is completed, the counter will enable. The grouped pixels operate in parallel. But, each group is connected to the output of the register block in a pipelined manner. The central pixel part of the filter section is not connected to any part and it is given as input to the SSA as mid\_pix. This highly parallelized pipeline architecture provides the pixel clock, which is vital for the FPGA design concept.



Fig. 2. Proposed NABI work

#### **3.2** STRUCTURE SHARED ARCHITECTURE (SSA)

In order to reduce the complexity of HPA the Structured Shared Architecture (SSA) is introduced. This SSA is built next to the kernel based architecture for the edge preserving filter. It can be used to reduce the usage of resources compared to the parallel in nature.. This architecture consists of multiplexers, control unit, adders, subtractor and a divider section. In order to

avoid the long delay for critical path, this SSA is constructed. The function of this block is similar to the work given in charoensak et al. [9].

## 3.3 MASTER CONTROL UNIT (MCU)

This master control unit consists of two components such as GTE and intensity calculator

## • Graph theory based Traffic Estimator (GTE)

The technical information about Graph theory based traffic estimator (GTE) [11] is taken from sakthivel et al.

#### • Intensity Calculator

The intensity of the input data is calculated using MCU. The image intensity function is calculated mathematically using the Laplacian concept with symmetric boundary conditions [12].

#### • Reconfigurable Server (RS)

The main functionality of reconfigurable server is to provide reconfigurable information to filter section and kernel architecture. With respect to input traffic and image size the kernel size and filter size is reformed accordingly.

|                                        | Charoensak et al    | Vinh et al           | Dutta et al      | Pal et al             | Gabiger-Rose et al | Proposed           | Proposed       |
|----------------------------------------|---------------------|----------------------|------------------|-----------------------|--------------------|--------------------|----------------|
|                                        | [8]                 | [9]                  | [10]             | [13]                  | [7]                |                    | new            |
| Filter method                          | BF                  | HPA-BF<br>SSA-BF     | BF               | HPA,SSDA,HD           | BF                 | HPA                | SSA            |
| Kernel size                            | 15x15               | 3x3                  | 3x3              | N.R                   | 5x5                | Variable           | (3x3 to 15x15) |
| FPGA Family                            | Xilinx<br>Spartan 3 | Altera<br>Cyclone II | Xilinx Virtex-II | Xilinx<br>Virtex-5    | Xilinx<br>Virtex-5 | Xilinx<br>Virtex-5 |                |
| Max Clock Frequency                    | 72.2 MHz            | 159 MHz              | 87.65 MHz        | N.R                   | 220 MHz            | 180MHz             | 245MHz         |
| Maximum Frame rate (fps)<br>@1024x1024 | 4.58                | 151.63               | 83.59            | with 150x150<br>image | 52.45              | 50.12              | 78.42          |
| Resource<br>Logic element              | -                   | -<br>567 , 450       |                  |                       |                    |                    |                |
| Logic slice<br>Multipliers             | 2150<br>n.a         | -,-<br>32,4          | 1447<br>9        | <br>1586,623,740      | 1060<br>23         | 1280<br>36         | 724<br>18      |
| Maximum path delay                     | 13.8ns              | 6.3ns,4.4ns          | N.R              |                       | N.R                | 2.8                | 1.4            |
| Static Power                           | N.R                 | N.R                  | N.R              | 1.188, 0.702 1.188    | N.R                | 1.212              | 0.863          |
| Dynamic Power                          | N.R                 | N.R                  | N.R              | 0.072, 0.025 0.068    | N.R                | 0.070              | 0.032          |
| Total Power                            | N.R                 | N.R                  | N.R              | 1.26 .0.728. 1.26     | N.R                | 1.282              | 0.895          |

#### Table 1. Performance comparison @ FPGA level

## 4 RESULT AND DISCUSSION

#### 4.1 FPGA LEVEL PERFORMANCE ASSIGNMENT

In Table 1 the performance variation in terms of FPGA level progress is discussed and compared with five difference conventional architecture. On the first one SF is implemented in Xilinx Spartan 3, which gives a maximum operating frequency at 72.2MHz the frame rate is reported 4.58NS, the resource utilization 2150 and the maximum delay path available in SF is 13.8ms. To provide better than SF, HPA and SSA are introduced. Here the limitation of this work the kernel is more limited than [8]. If the kernel size minimum automatically the complexity also reduced. This work is implemented in an Altera FPGA. Same kernel size implemented in [13] also Xilinx Virtex platform used and the results are observed that was reported in Pal et al.

To bring better performance than above architecture also to bring reconfiguration action in a single architecture than Gabiger-Rose et al [7] NABI is introduced. This proposed architecture developed via HPA and SSA with reconfigurable component's and intensity calculator implemented in Xilinx Virtex-5VLX50-1. Variable image inputs are capable to progress this architecture with low complexity. The best results are proven in terms of resource and power consumption of conventional work. The observed experimental results are reported in a table which gives superior performance than

conventional work in terms of resource and power consumption. The Fig.3 and Fig.4 indicate the performance comparison with conventional Gabiger-Rose et al.



Fig. 3. Resource allocation in SSA and conventional architecture



Fig. 4. Power consumption in SSA and conventional architecture

| Table 2. | Image quality assessment | (512x512 Image) |
|----------|--------------------------|-----------------|
|----------|--------------------------|-----------------|

|    | 7     |       |       |                 |       |        | Proposed work |         |       |      |        |       |       |      |           |       |       |      |
|----|-------|-------|-------|-----------------|-------|--------|---------------|---------|-------|------|--------|-------|-------|------|-----------|-------|-------|------|
|    | Noisy | image | MA    | ATLAB Model Sim |       | el Sim |               | Noisy i | image |      | MATLAB |       |       |      | Model Sim |       |       |      |
| σ  |       |       |       |                 |       |        | PSNR          |         | MSSIM |      | PSNR   |       | MSSIM |      | PSNR      |       | MSSIM |      |
|    | PSNR  | MSSIM | PSNR  | MSSIM           | PSNR  | MSSIM  | With          | With    | With  | With | With   | With  | With  | With | With      | With  | With  | With |
|    |       |       |       |                 |       |        | HPA           | SSA     | HPA   | SSA  | HPA    | SSA   | HPA   | SSA  | HPA       | SSA   | HPA   | SSA  |
| 10 | 28.12 | 0.74  | 31.41 | 0.86            | 31.00 | 0.85   | 40.12         | 43.12   | 0.42  | 0.21 | 42.3   | 45.42 | 0.53  | 0.36 | 40.6      | 44.21 | 0.37  | 0.37 |
| 20 | 22.19 | 0.51  | 27.23 | 0.74            | 27.00 | 0.71   | 37.10         | 40.12   | 0.33  | 0.18 | 38.0   | 40.12 | 0.41  | 0.27 | 38.2      | 40.15 | 0.26  | 0.26 |
| 30 | 18.80 | 0.37  | 25.00 | 0.62            | 24.70 | 0.59   | 34.33         | 45.12   | 0.27  | 0.10 | 35.1   | 37.13 | 0.32  | 0.18 | 35.1      | 36.33 | 0.19  | 0.19 |
| 40 | 16.47 | 0.28  | 23.41 | 0.53            | 23.75 | 0.51   | 30.10         | 42.6    | 0.16  | 0.08 | 32.4   | 34.06 | 0.26  | 0.10 | 32.4      | 34.71 | 0.11  | 0.11 |
| 50 | 14.75 | 0.23  | 22.19 | 0.47            | 21.93 | 0.44   | 28.2          | 39.3    | 0.10  | 0.04 | 38.7   | 32.2  | 0.18  | 0.04 | 27.7      | 33.2  | 0.06  | 0.06 |
| 60 | 13.43 | 0.19  | 21.17 | 0.41            | 20.93 | 0.39   | 26.3          | 27.2    | 0.04  | 0.01 | 27.3   | 30.60 | 0.10  | 0.01 | 26.9      | 30.33 | 0.02  | 0.02 |

Table 3. image quality assessment (1024x1024 image)

|           |            |            |      |        | d work   |       |      |      |           |       |       |      |  |
|-----------|------------|------------|------|--------|----------|-------|------|------|-----------|-------|-------|------|--|
| standard  |            | Noisy i    | mage |        |          | MAT   | LAB  |      | Model Sim |       |       |      |  |
| deviation | PSNR MSSIM |            |      |        | PSNR     |       | MS   | SIM  | PSN       | ١R    | MSSIM |      |  |
| (σ)       |            | With CCA   | With | With   | With HDA | With  | With | With | With HDA  | With  | With  | With |  |
| V         |            | A WILL SSA | HPA  | PA SSA |          | SSA   | HPA  | SSA  |           | SSA   | HPA   | SSA  |  |
| 10        | 43.6       | 46.62      | 0.61 | 0.42   | 42.11    | 47.41 | 0.66 | 0.5  | 41.33     | 46.22 | 0.64  | 0.48 |  |
| 20        | 39.6       | 41.32      | 0.43 | 0.36   | 39.73    | 42.63 | 0.53 | 0.42 | 38.66     | 40.18 | 0.53  | 0.41 |  |
| 30        | 35.31      | 38.61      | 0.33 | 0.27   | 36.87    | 39.19 | 0.41 | 0.21 | 30.18     | 35.33 | 0.48  | 0.37 |  |
| 40        | 32.87      | 34.33      | 0.24 | 0.18   | 32.16    | 36.27 | 0.27 | 0.1  | 28.33     | 32.12 | 0.36  | 0.24 |  |
| 50        | 31.83      | 32.61      | 0.19 | 0.10   | 30.06    | 34.13 | 0.13 | 0.04 | 27.65     | 28.61 | 0.18  | 0.12 |  |
| 60        | 28.61      | 30.33      | 0.10 | 0.04   | 28.99    | 32.16 | 0.05 | 0.01 | 26.91     | 27.77 | 0.10  | 0.04 |  |







Fig. 6. PSNR vs standard deviation ( $\sigma$ )

## 4.2 IMAGE QUALITY ASSIGNMENT

The image quality assessment parameters in terms of varying standard deviation ( $\sigma$ ) values (10-60) PSNR and MSSIM are reported. The mathematical formulations are taken from Gabiger-Rose et al [7] to estimate the Image quality assessment. This proposed work Image quality assessment is reported in the Table 2 (for 512 x512) and Table 3 (for 1024x1024). The results re compared with [7] and its reported in the table. From the results we can observe the PSNR and MMISE are giving better results than [7]. The Fig 5 and Fig. 6 are indicated the performance comparison with conventional Gabiger-Rose et al.

## 5 CONCLUSION

An FPGA based novel approach for bilateral filter is introduced in this work. This proposed NABI is combination of reconfigurable server and master control unit with HPA architecture. The same modules are utilized in SSA architecture. The HPA architecture and SSA architecture are implemented in Xilinx Virtex-5VLX50-1. The experimental results are observed in terms of FPGA level and image quality parameters. This HPA and with reconfigurable components is compared with conventional work. The complete work of NABI is summarized and the work flow is explained as below

- Development of new hardware modules for reconfiguration process
- The reconfirmation modules are reconfigurable server and master control unit
- Encapsulation of HPA and SSA architecture are into reconfigurable components.
- Introduction synchronous clock into HPA and SSA architecture
- Performance analysis in terms of number of slices, multiplier, maximum path delay and power consumption
- Performance analysis in terms of image quality assessment such as PSNR and MSSIM.
- This both HPA and SSA are architecture are compared with conventional work [7].

#### REFERENCES

- [1] Tomasi, Carlo, and Roberto Manduchi. "Bilateral filtering for gray and color images." In *Computer Vision, 1998. Sixth International Conference on*, pp. 839-846. IEEE, 1998.
- [2] Gabiger, Anna, Matthias Kube, and Robert Weigel. "A synchronous FPGA design of a bilateral filter for image processing." In *Industrial Electronics, 2009. IECON'09. 35th Annual Conference of IEEE*, pp. 1990-1995. IEEE, 2009.
- [3] Riesgo, Teresa, Yago Torroja, and Eduardo De la Torre. "Design methodologies based on hardware description languages." *Industrial Electronics, IEEE Transactions on* 46, no. 1 (1999): 3-12.
- [4] Hannig, Frank, Moritz Schmid, Jurgen Teich, and Heinz Hornegger. "A deeply pipelined and parallel architecture for denoising medical images." In *Field-Programmable Technology (FPT), 2010 International Conference on*, pp. 485-490. IEEE, 2010.
- [5] Monmasson, Eric, and Marcian N. Cirstea. "FPGA design methodology for industrial control systems—A review." *Industrial Electronics, IEEE Transactions on* 54, no. 4 (2007): 1824-1842.
- [6] Rodriguez-Andina, Juan J., Maria J. Moure, and Maria D. Valdes. "Features, design tools, and application domains of FPGAs." *Industrial Electronics, IEEE Transactions on* 54, no. 4 (2007): 1810-1823.
- [7] Gabiger-Rose, Anna, Matthias Kube, Robert Weigel, and Rachel Rose. "An FPGA-based fully synchronized design of a bilateral filter for real-time image denoising." *Industrial Electronics, IEEE Transactions on* 61, no. 8 (2014): 4093-4104.
- [8] Charoensak, Charayaphan, and Farook Sattar. "FPGA design of a real-time implementation of dynamic range compression for improving television picture." In Information, Communications & Signal Processing, 2007 6th International Conference on, pp. 1-5. IEEE, 2007.
- [9] Vinh, Truong Quang, Ju Hyun Park, Young-Chul Kim, and Sung Hoon Hong. "FPGA implementation of real-time edgepreserving filter for video noise reduction." In *Computer and Electrical Engineering, 2008. ICCEE 2008. International Conference on*, pp. 611-614. IEEE, 2008.
- [10] Dutta, Hritam, Frank Hannig, Jürgen Teich, Benno Heigl, and Heinz Hornegger. "A design methodology for hardware acceleration of adaptive filter algorithms in image processing." In *Application-specific Systems, Architectures and Processors, 2006. ASAP'06. International Conference on*, pp. 331-340. IEEE, 2006.
- [11] Sakthivel Erulappan, Veluchamy Malathi, and Muruganantham Arunraja. "MATHA: Multiple sense amplifiers with transceiver for high performance improvement in NoC Architecture." Microprocessors and Microsystems 38, no. 7 (2014): 692-706.
- [12] Pal, Chandrajit, Avik Kotal, Asit Samanta, Amlan Chakrabarti, and Ranjan Ghosh. "Design space exploration for image processing architectures on FPGA targets." arXiv preprint arXiv:1404.3877 (2014).
- [13] You, Yu-Li, and Mostafa Kaveh. "Fourth-order partial differential equations for noise removal." Image Processing, IEEE Transactions on 9, no. 10 (2000): 1723-1730.

- [14] Demirtas, Ali Murat, Amy R. Reibman, and Hamid Jafarkhani. "Full-Reference Quality Estimation for Images With Different Spatial Resolutions." Image Processing, IEEE Transactions on 23, no. 5 (2014): 2069-2080.
- [15] [Online]. Available: http://www.mathworks.com
- [16] [Online]. Available: http://www.xilinx.com