A Hardware Accelerated Configurable ASIP Architecture
Transcrição
A Hardware Accelerated Configurable ASIP Architecture
A Hardware Accelerated Configurable ASIP Architecture for Embedded Real-Time Video-Based Driver Assistance Applications Gregor Schewior, Holger Flatt, Carsten Dolar, Christian Banz, and Holger Blume Leibniz Universität Hannover, Institute of Microelectronic Systems Appelstr. 4, 30167 Hannover, Germany {schewior, flatt, dolar, banz, blume}@ims.uni-hannover.de Abstract—In this paper, a flexible HW architecture for videobased driver assistance applications is presented. It comprises a customizable and extensible processor template and several task-specific HW accelerators. The proposed heterogeneous architecture allows utilization of the programmable processor core for control and low data rate tasks. For the acceleration of computationally intensive tasks of the application, special functional units and custom instructions can be added to the processor template to form an application specific instruction set processor (ASIP). Moreover, dedicated HW accelerators can be attached to the ASIP. To compare the diverse design options, a shape detection application for traffic sign detection is utilized as case study. It is shown that single tasks of a pure software implementation can be significantly accelerated by usage of special functional units by a factor of up to 35 and by usage of HW accelerators of up to 243. The proposed architecture has been mapped onto an FPGA and it could be shown that a realtime capable system can be realized. I. I NTRODUCTION In recent years, a number of driver assistance features has been introduced to the automobile. One of the major trends thereby is the usage of cameras for various tasks, for example lane detection, traffic sign recognition, and obstacle detection. The purpose is to prevent collisions with pedestrians, vehicles, or other objects. Such advanced driver assistance (ADA) systems rely on computationally complex algorithms. Therefore, it can be foreseen that simple controller units as used today cannot supply sufficient performance. Thus, new concepts and system architectures have to be investigated. Two of the major requirements for ADA hardware architectures are performance and flexibility [1]. For vision-based ADA systems, high throughput in terms of frames per second (fps) and low latency increase the system’s ability to react instantaneously to sudden events, e.g. a pedestrian unexpectedly crossing the street. For example, a vehicle moving at 50 km/h advances 14 m in one second. That corresponds to 50 cm between frames when considering 30 fps. It becomes obvious that the architecture needs to provide high performance to generate a maximum of throughput, especially for driving safety related ADA tasks (e.g. emergency break assistant). The need for high flexibility results from the fact that algorithms for ADA systems are still evolving rapidly. To keep up with the algorithmic changes, the hardware architecture should enable 978-1-4577-0801-5/11/$26.00 ©2011 IEEE the developers to integrate those changes quickly. With a highly flexible system, time-to-market can be kept as short as possible. Taking those aspects into account, two architectural trends can be observed: System-On-Chip solutions using a standard programmable device with associated dedicated HW accelerators [2, 3] or using specially designed application specific processors [1]. Since these architectures are typically implemented as integrated circuits, they demand for a huge number of sold devices. Concerning specialized premium ADA systems with a limited number of sold devices, an ASIC implementation might not offer the most cost-efficient solution. Due to their reconfigurability, field programmable gate array (FPGA) devices offer a feasible alternative in such a scenario. Thus, an FPGA-based platform offers the chance to re-use a common HW basis for diverse applications. The uprising question is, however, which system architecture mapped onto the FPGA provides short development cycles as well as high performance. It has been shown that dedicated HW architectures offer a significant increase in throughput, e.g. for depth map generation in stereo image processing [4] or detection of cars in dark surroundings [5]. Considering a well defined modular system, design time for dedicated HW accelerators can be kept quite short. It is, however, not feasible to design dedicated HW accelerators for every algorithm of the ADA application. Especially algorithms with a high degree of data dependent control logic can be mapped more efficiently onto a programmable device. An alternative to the usage of an off-chip standard processor is the integration of the processor as soft-core into the FPGA-based HW architecture. This offers the ability to directly connect the programmable device to the infrastructure of the dedicated HW accelerators. It also enables the designer to adapt the processor features to the application. Moreover, new special functional units can be defined for the acceleration of common and application specific tasks. The contribution of this paper is twofold. Firstly, we will introduce a heterogeneous HW architecture. The architecture consists of an application specific instruction set processor (ASIP) with attached dedicated hardware accelerators. Secondly, we will discuss the design options in the mapping process of ADA applications to a heterogeneous FPGA-based hardware architecture. Therefore, the ASIP is used for control 209 +:$FFHOHUDWHG&RQILJXUDEOH $6,3$UFKLWHFWXUH as well as data processing tasks. It is based on a commercially available configurable and extensible processor template provided by Tensilica [6]. As an exemplary application for the field of video-based driver assistance systems, we use traffic sign recognition. The different parts of the application have been implemented as dedicated hardware accelerators and on the ASIP with newly generated instruction set extensions. This paper is organized as follows: Section II reviews the related work. In section III the configurable architecture is described in detail. The case study is explained and mapped onto the configurable architecture in section IV. Results are given in section V. Finally, section VI concludes this paper. ;7HQVD 7,( 3( 3(Q 2&3%86 /RFDO 0HPRU\ Fig. 1. 65$0 ,) ''5 6'5$0,) 9LGHR ,2 65$0 ''5 6'5$0 &DPHUD 'LVSOD\ HW accelerated configurable ASIP architecture II. R ELATED WORK The benefit of application specific processor extensions is discussed e.g. by Beucher et al. [7] or Payá-Vayá et al. [8]. These works demonstrate the capabilities of SW acceleration by designing custom instruction set extensions for block based motion estimation and stereo image processing, respectively. Both approaches utilize a single instance of an enhanced processor core as is done in this work. With the application specific instruction sets, significant speedups by two orders of magnitude can be reached. Concerning task level parallelization, multi-core systems offer a valuable solution. Khan et al. [9] have designed a multiprocessor system that is based on the configurable NIOS II processor [10]. Although NIOS can be extended with custom instructions, multiple base processor instances have been sufficient in their work. Fontaine et al. [11] combine three extended Tensilica Xtensa processors to a multiprocessor system for 3D target tracking. In order to reduce the hardware overhead of the multiprocessor systems, single processors can be replaced by dedicated hardware accelerators. Claus et al. [12] propose an FPGAbased HW architecture for automotive tasks. This architecture combines the integrated PowerPCs of the Xilinx VirtexII Pro with attached dynamically reconfigurable HW accelerators. Flatt et al. [13] introduced a HW architecture that combines a standard RISC processor with an attached FPGA onto which a configurable coprocessor architecture is mapped. Instead of using on- or off-chip hard-core processors, in this paper a configurable and extensible processor is mapped as soft-core onto the FPGA to form a System-On-ProgrammableChip (SoPC). One of the main advantages is the configurability of the processor interfaces [14]. In So(P)Cs the interface widths are not restricted by the pin-count of the package. Thus, one can use for example a 128 bit processor bus for high data throughput or closely attached HW accelerators to the processor. However, the maximum frequency of the soft-core processor is limited in an FPGA. To overcome this disadvantage, special functional units and instructions need to be added to the processor in order to speed up the target application. III. C ONFIGURABLE A RCHITECTURE A. Overview Fig. 1 shows the concept of the configurable architecture. The focused Tensilica Xtensa LX2 processor core is a 32bit architecture to which custom instructions and register files can be added using a verilog-like description language. The processor is connected to the 64-bit wide multi-layer OCP [15] system bus. The data cache is two-way associative (writeback), has a size of 32 kB and a 32 byte line size. It is connected to the processor via a 128 bit interface. The instruction cache size is 16 kB with 64 byte lines. On the one hand, the processor is used for application execution and system control. In order to achieve a fine-grained acceleration of specific functions, custom Tensilica Instruction Extensions (TIE) can be integrated into the processor pipeline. A fast SRAM based local memory, which is used for storing data and instructions, is directly connected to the Xtensa. On the other hand, a coarse-grained acceleration of computationally intensive algorithms (CIA) is realized by dedicated processing elements (PE), which are controlled by the Xtensa. They are started by the ASIP via function calls and operate autonomously. In order to support high-speed memory access that is independent to the addressing schemes, heterogeneous memory types are attached to the system. A fast DDR2 SDRAM is used as main memory. High data rates are supported if the memory is addressed sequentially. In order to support irregular addressing schemes without performance decrease, an additional SRAM is attached to the system. A dedicated video I/O interface connects the SoPC directly to a camera and display [4]. This interface is directly connected to the DDR2 SDRAM. In general, the system achieves a high flexibility due to the configurable ASIP and the growing library of implemented image processing elements. The instruction set extensions of the ASIP, the PEs and all interfaces can be easily changed and added according to the requirements of the desired application. 210 , C. Hardware Accelerators In order to achieve an higher throughput at lower chip resources, hardware accelerators - further called processing elements (PE) - can be integrated into the system. The design time of the PEs can be reduced if a configurable architecture template is used. This template provides mechanisms for control and external bus transfers, which reduces the design effort for the implementation of new algorithms to the data path. (;(;0 [ELW $5UHJLVWHU [ELW 95UHJLVWHU 6,0'$/8 ELW ,QVWUXFWLRQGHFRGH ,QVWUXFWLRQ 0HPRU\ : 8SGDWH 5HJLVWHU ILOHV 6)8V 6)8V Fig. 2. Pipeline scheme of the Xtensa LX2 processor for an arithmetic instruction execution showing the new vector register file, SIMD ALU, and special functional units. %XV 3URFHVVLQJ(OHPHQW %XV0DVWHU %XV6ODYH ,QSXW ),)2V 5HJLVWHU)LOH ,QWHUQDO0HPRU\ 0HPRU\ $GGUHVVHV 'DWD3DWK 3DUDPHWHU 6WDUW 6WDWXV)ODJ To achieve high overall acceleration one single functional unit (SFU) will not be sufficient, but rather the three following key concepts are necessary, which have been identified in [7]: • Perform parallel computations. Perform as many computations in parallel as possible (e.g. using SIMD). This implies a more data-greedy algorithm implementation which requires higher bandwidth load/store operations and increased data reuse. • Maximize data reuse. Reuse data (e.g. intermediate results or once-loaded pixels) as often as possible and as soon as possible so that it can be kept in local fast memory (e.g. register file). This eliminates the need to load the same data several times or store intermediate results into slow external memory. • Efficient cache usage. Access data on the external memory in such a way that it can easily be cached. Except for boundary conditions, always use all the data contained in a cache line. Thus, lengthy pipeline stalls due to cache misses are minimized. To achieve the required overall acceleration, the following distinct enhancements to the base processor have been implemented into which these concepts have been incorporated. • A 16x512 bit register file with specialized load/store instructions. • Special functional units implement dedicated data paths required by the particular algorithms. • A generalized media instruction set for highly parallel data processing using SIMD, which is not restricted to a particular algorithm. Fig. 2 shows the placement of the three hardware extensions within the pipeline of the processor. All basic media processing instructions support SIMD and generally have several modes in which they can be operated to make them areaefficient and deployable in a wide number of cases. In addition to the 32-bit wide general purpose register file AR which is part of the basic processor, a second register file, called vector register file VR, has been implemented. The size of the vector register file is 16x512 bit. It is used to enable massively parallel computations and maximum data reuse throughout the image processing applications. Since the maximum external bus width is 128 bit, special load/store operations have been implemented, which allow traditional access schemes as well as schemes to process data over the boundaries of word-alignment making stream-based data processing very efficient. 5 $/8 ELW B. ASIP 2XWSXW ),)2V &RQWURO &RQWURO Fig. 3. 'DWD Architecture Template of Processing Element Fig. 3 shows a generic architecture template of an autonomous PE, which is based on the work of Flatt et al. [16]. It comprises a slave bus interface, a control unit, a master bus interface for accessing external data, and a data path for performing computations. An internal memory can be integrated into the PE when needed. Performing a computation task requires that the ASIP transfers function calls to the processing element via the bus slave interface first. A function call comprises data memory addresses and defines function specific parameters. Afterwards the PE starts processing, source data is taken from external memories via the bus master interface or directly from internal memories if available. After finishing computations, the PE sets a status flag in order to receive the next task. The performance suffers mainly from data parallelism in the data path. Compared to load/store architectures, the PE works as a pipeline. Using input and output FIFO memories, data transfers with external memory and data processing can be performed in parallel. If the data bus width is higher than the word width of the processed data, multiplexers and 211 5*%,PDJH ,0* aK &RORU +LVWRJUDP +LVWRJUDP 7KUHV 6SOLWWHU *HQHUDWLRQ $QDO\VLV KROGLQJ &6 +* +$ 7KU %LQDU\,PDJH%, )HDWXUH )HDWXUH 6KDSH /DEHOLQJ ([WUDFWLRQ )LOWHULQJ &ODVVLILFDWLRQ %, /DE )( Fig. 4. )) 6& aa VY PK K PVY VY 6KDSH Task graph of shape detection Fig. 5. demultiplexers have to be integrated into the pipeline. The maximum data parallelism is limited by the data bus width of both external memory and system bus. Due to the autonomous operation of the processing elements, several tasks can be executed on different PEs in parallel. IV. C ASE S TUDY A. Application Video-based traffic sign recognition is one important example of advanced driver assistance systems. Common approaches are usually separated into three main stages [17]. In the first stage, the input image of the current frame is analyzed for traffic sign candidates. Using previous frames, the traffic signs of the current frame are tracked in the second main stage. In the last stage, a final classification of the tracked candidates is performed. Due to the requirement of analyzing the whole frame, the first stage is characterized by high computational complexity. Therefore, it is suitable as a case study for the evaluation of the proposed architecture. In the following, an object detection chain, which can be used for traffic sign detection, is presented. The task graph, which has to be executed for different typical colors of traffic signs (e.g. red, blue, yellow and white), is shown in Fig. 4. Computationally intensive tasks are represented by circles, the remaining algorithms are visualized with rectangles. In order to achieve reliable candidates, colored input images are processed. The first task color splitter (CS) separates the colored input image into a single color channel (e.g. red). As described by Wu et al. [18] and Fu and Huang [17], color separation can be simplified by a preceding color space conversion. Therefore, each 24 bit RGB input pixel is converted to 24 bit HSV space. Color values of traffic signs usually are fixed to small hue intervals h and they appear with high brightness and saturation values s and v. In order to extract these three features, each component h, s, and v is transformed via a user defined weighting function mh,s,v according to Fig. 5. The results h, s, and v can be interpreted as component propabilities of a traffic sign color. The final gray value g is computed by multiplication and normalization of h, s, and v as shown in Eq. 1. Fig. 5 visualizes the splitting process for the red color channel. Principle of color splitting g = ( h · s · v) >> 16 (1) In order to detect traffic sign candidates, the further processing is based on the hardware optimized real-time object detection chain of Flatt et al. [13]. The succeeding tasks segment the color splitted gray value image g and provide a list, which contains all traffic sign candidates. Due to illumination changes, a histogram generation (HG) and a histogram analysis (HA) are performed in order to calculate a single threshold value for the image thresholding (Thr) task. The thresholding task separates background and object pixels. Traffic sign candidates are generated via connected component labeling (Lab) and shape feature extraction (FE) tasks. According to the object features, a filtering step (FF) discards objects whose sizes are differing from traffic signs. Similar to [18] the shape of each remaining object is classified with a template matching based approach in the last step SC. Therefore, each object is separated into 16 regions of same size. Using the connected component labeling result, the number of object pixels in each region is calculated afterwards. A template matching with reference shapes (e.g. circle, triangle with arrow up/down, rectangle and rhomb) is used for final shape classification. B. Mapping For an evaluation of the processing performance and area requirements, the single tasks of the shape detection application are mapped onto different architectures, namely the Tensilica Xtensa LX2 base core, the LX2 base core extended with the VR register file and special functional units, onto processing elements, and onto a commercial digital signal processor. Therefore, the tasks with high processing requirements are identified and adequately optimized. 1) Xtensa LX2 base core: For a comparison of the processing performance of SFUs, hardware accelerators and the DSP, all the tasks of the application are implemented for the ASIP base core as C code without usage of extensions. This means that especially designer-defined register files, generic vector instructions, and SFUs are not utilized. Additionally, it serves as the reference implementation for functional verification with the implementations for the other architectures. 212 2) Xtensa LX2 base core with extensions: The most computationally intensive algorithms of the shape detection C reference code are accelerated by the usage of special functional units. These are the tasks CS, HG, Thr, Lab and FE, which process each pixel of the input image. All SFUs use the VR register file and thus are able to compute their tasks for multiple pixels in parallel. Moreover, multiple pixels are read and written during one instruction by using the special 128 bit load and store operations saving access time to local and external memory. Table I summarizes the implemented operations. The CS task is accelerated with a complex SFU due to highly complex computations per pixel. The performed calculations are two divisions, ten multiplications and additions and comparison operations. Therefore, the COLORSPLIT operation is scheduled for 8 clock cycles. Due to a pipelined execution, the average computation time is one clock cycle per execution. With one execution of the operation, one RGB pixel composed of 24 bit is transformed to the HSV color space. Additionally, according to given thresholds one 8 bit gray value is computed for the selected color. There are 15 required parameters for the calculations, which are stored in one 256 bit VR register. For the required divisions, a nonrestoring 16 bit division function [19] is implemented which performs within 3 clock cycles. The output gray value image has to be stored in external memory for later usage by the Thr task. For the HG task, two SFUs are implemented. Due to random read/write accesses the resulting histogram is stored in a 1 kB local memory which enables to execute the memory access operations within one clock cycle. The LOADHIST operation loads from the local memory the corresponding histogram bin content. Therefore, the load address is computed by addition of the start memory address of the histogram and an offset which is the pixel value of the color splitted gray value image. The STOREHIST operation increments the previously loaded histogram value and writes it back to local memory. Each operation processes one pixel per execution end executes in one cycle. For a higher performance, both of the tasks CS and HG are executed jointly. The results of the CS task are stored in registers of the VR register file and passed to the HG task saving costly load operations. This is easily implementable due to the control possibilities of the processor programmability. The HA task is performed in software due to its low data rate requirement of 256x32 bit values. Therefore, an adequate threshold for the Thr task is computed using Otsu’s method [20]. The Thr task performs the thresholding on the CS task output image using the threshold computed by the HA task. The VR_THRESHOLD operation processes 64 pixels of the CS task output in parallel by using a 512 bit VR register as input. The binary object pixels output image is stored in local memory for further processing. The Lab task is mapped to four SFUs and is executed twice. During the first run, labels and a correspondence lookup table, which is stored in local memory, for the connected labeling, are calculated. During the second run, also labels are calculated and the lookup table is used for the output assignment of the labels. The LABEL_1ST_ROW operation generates labels for the first row of the binary object pixels image. It processes eight pixels in parallel and for consecutive operation calls intermediate results are stored in dedicated states. The LABEL_OTHER_ROWS operation processes two pixels of the other rows of the image in parallel. The intermediate results are also stored in dedicated states for the usage of further processing steps. The operation also examines if a correspondence check has to be performed and writes out the label values to be checked in a dedicated state. This state is used by a correspondence function implemented in software for building the lookup table for connected component labeling. During the second run of the algorithm, instead of the correspondence function the LOAD_LOOKUP operation is performed, which stores in its output register the corresponding labels from the lookup table. An additional function is required to shift the input register by 8 pixels to the left. This is done by the SHIFT_8_LEFT operation. The output of the Lab task is stored in external memory. For the FE task, a fast 128 kB local memory, which is attached to the processor, is used as feature accumulation memory. There is one feature vector for every object in the label image. Each feature vector has a size of 256 bit. It containsinformation object’s moments (number of about the pixels, x, y, x2 , y2 , xy) as well as its bounding box. Whereas the moving of the feature vectors between the VR register and the local memory is handled by general load/store instructions, the determination and the feature accumulation has been mapped to the special functional unit FE_ACC_FEATURES that uses one shared multiplier for all multiplications (16 bit) and several adders. For fast iteration, 8 consecutive positions of the label image are loaded in a VR register. To prevent unnecessary calculations, the loaded labels are jointly checked if there is any valid label by the VR_EQZ operation. If not, the next 8 labels are loaded. Otherwise the 8 labels are processed sequentially. The accumulation instruction uses predication and writes back the results only in case of a valid label. Both of the tasks Lab and FE are executed jointly due to the same output and input operands, respectively, that are stored in a VR register. This speeds up the execution time of the FE task due to the omitted load operations of the labels, thus minimizing memory accesses to external memory. 3) Processing Elements: In order to accelerate the computation intensive algorithms of the case study, processing elements for the tasks histogram generation (HG), thresholding (Thr), labeling (Lab), and feature extraction (FE) were taken from a growing image processing library [13]. The color splitter (CS) is implemented according to the architecture template of Fig. 3. The data path receives and processes one RGB input pixel in each clock cycle. Based on dedicated multipliers and pipelined dividers, the first step is to transfer an RGB input pixel to HSV space. As explained in Fig. 5 and Eq. 1, the gray value is computed afterwards 213 TABLE II S YNTHESIS RESULTS OF THE LX2 AND LX2 + EXTENSIONS FOR A V IRTEX -5 FPGA FOR 100 MH Z TARGET FREQUENCY TABLE I S PECIAL FUNCTIONAL UNITS , TIE EXTENSIONS Operation name COLORSPLIT LOADHIST STOREHIST VR_THRESHOLD LABEL_1ST_ROW LABEL_OTHER_ROWS SHIFT_8_LEFT VR_EQZ UPACK_EXTR_X16 FE_ACC_FEATURES Functionality Complex operation for CS task Loading operation for histogram content Increment and store operation for histogramming Parallelized thresholding operation for 64 pixels Labeling of first row of an image (8 pixels parallel) Labeling and correspondence check (2 pixels parallel) 8 pixels left shift operation required in the Lab task Check if an input vector is non-zero 16 bit extraction from a register Complex operation for FE task Configurations LX2 base core LX2 + VR Reg. LX2 + VR Reg. LX2 + VR Reg. LX2 + VR Reg. LX2 + VR Reg. LX2 + VR Reg. LX2 + VR Reg. + + + + + + CS HG Thr Lab FE all Resources [LUT] 21,320 33,050 45,162 36,441 33,509 33,488 36,369 49,821 Diff. to LX2 + VR 12,112 3,391 459 438 3,319 16,771 TABLE III S YNTHESIS RESULTS OF THE PE S FOR A V IRTEX -5 FPGA in the second step. The arithmetic functions comprise mostly dedicated multipliers and subtractors. In order to achieve high synthesis frequencies, the whole data path is deeply pipelined. 4) Digital Signal Processor: In order to compare the performance with a commercial processor, a mapping of the case study onto a Texas Instruments OMAP3530 is examined, too. This SoC comprises a 720 MHz ARM Cortex-A8 RISC, which executes the OS, runs control tasks, and performs simple signal processing tasks. The computation intensive algorithms are carried out by a 520 MHz TMS320C64x+ DSP. The computation intensive algorithms of the case study are implemented as C code and optimized with intrinsic functions. The implementation of the algorithms HG and Thr are based on the optimized DSP Image/Video Processing Library of Texas Instruments [21]. V. R ESULTS A. Synthesis To determine the FPGA resource requirements, FPGA synthesis was performed utilizing Xilinx ISE Design Suite 12.3 for each PE as well as for the basic ASIP template (i.e. the Tensilica Xtensa LX2), the ASIP with the additional VR register file, and the ASIP with the VR register file plus various SFUs. The target FPGA is a Virtex-5 XC5VLX330 and the target operating frequency is 100 MHz. Table II presents the synthesis results of the ASIP and the ASIP with additional extensions for a Virtex-5 FPGA. It has to be noted that the generated netlist by the Tensilica tools is optimized for ASIC synthesis. Thus, during synthesis, no FPGA specific units such as block RAMs or HW multipliers are used. From the HW implementation point of view, the Tensilica LX2 is not the optimal soft-core processor for FPGA based systems. However, the provided toolflow for processor extensions and SW generation enables fast development cycles. The FPGA resource requirements for LX2 can therefore only be given in look-up tables [LUT]. Table III presents the synthesis results of the PEs in terms of LUTs, BRAM and 18 bit HW multipliers. PE LUTs BRAM [kB] CS HG Thr Lab FE 1,709 985 1,029 1,997 1,929 16 20 16 36 144 18 bit HW multipliers 6 0 0 0 3 The presented synthesis results are the results before place and route and no memory or caches were synthesized. Additionally to the FPGA synthesis, a synthesis with Synopsys Design Compiler v2010 was performed for the LX2 base core and the LX2 base core with all presented extensions for a high performance process technology from TSMC (90 nm-GT) (worst case estimation) without synthesis of memories and caches for a target frequency of 373 MHz. The size of the LX2 base core increased from 147,918 to 417,790 equivalent gate count. This corresponds to a factor of 2.8. The Tensilica tools also provide gate count numbers of the configured processor and the user-defined extensions, i.e. register files, states, SFUs, etc.. Those numbers are approximations and were confirmed by the synthesis. The resulting deviation between both numbers was less than 3%. Compared to the LX2 base core utilizing 21,320 LUTs, there is an increase up to 49,821 LUTs for the extended cores. This corresponds to a factor of 2.3. The biggest impact has the register file and the CS SFU. It has to be remarked that the difference of required LUTs between the LX2 base core with all extensions and the LX2 base core with the VR register file does not equal the sum of the differences of required LUTs between the LX2 base core with the VR register file and single functional units and the LX2 base core with the VR register file. Thus, the number of required LUTs for the single extensions cannot be added due to the non additive correlation. B. Performance Table IV shows the required execution times in clock cycles per pixel for the tasks CS, HG, CS + HG, Thr, Lab, FE, Lab + FE incl. load and store operations to/from RAM for the evaluated architectures. In table V the corresponding speedup 214 TABLE IV N ORMALIZED EXECUTION TIME IN CLOCK CYCLES PER PIXEL OF THE (A) (B) LX2 PROCESSING TASKS ON THE EVALUATED ARCHITECTURES Task CS HG CS + HG Thr Lab FE Lab + FE Total LX2 base core 242.8 8.44 251.24 9.4 36.04 23.13 59.17 319.81 extended LX2 7.19* 1.06 15.3* 23.55 HW acc. 1 1 2 0.079 1 0.2 1.2 3.279 LX2 DSP 187.6 2.2 189.8 4.1 28.3 10.8 39.1 233 All Extensions (C) (D) LX2 LX2 TABLE V S PEEDUP FACTORS OF THE PROCESSING TASKS COMPARED TO THE LX2 CS HW acc. 242.8 8.44 125.62 118.99 36.04 115.65 49.3 97.5 Thr FE ... CS All PEs BASE CORE ASIP + ext. 34.94* 8.87 3.87* 13.58 VR Thr, Lab (*) The tasks CS and HG, and Lab and FE can be executed jointly on the extended LX2 to maximize data reuse and minimize access to external memory Task CS HG CS + HG Thr Lab FE Lab + FE Total VR DSP 1.29 3.8 1.32 2.29 1.27 2.14 1.5 1.37 Fig. 6. HG FE Exemplary system configurations. TABLE VI FPGA RESOURCES AND SYSTEM FREQUENCY FOR EXEMPLARY SYSTEM CONFIGURATIONS Configuration (A) (B) (C) (D) (*) The tasks CS and HG, and Lab and FE can be executed jointly on the extended LX2 to maximize data reuse and minimize access to external memory factors are given. The clock cycle counts for the LX2 base core and the extended ASIP are generated by a cycle-accurate ISS simulator provided by Tensilica including memory latencies and cache misses. For the DSP, the cycle count of each algorithm is estimated by simulation with the Code Composer Studio 3.3 and also includes memory latencies. Equivalent to the proposed system, the comparison is done at VGA resolution. The results show that the PEs outperform all evaluated approaches significantly in terms of throughput. A speedup between 8.44 up to 242.8 for the CS task is reached. In comparison utilizing the proposed SFUs a, speedup between 3.4 and 34.94 is possible. Comparing with the proposed implementation times, the DSP outperforms the ASIP implementation without TIE extensions with a speedup of 1.27-3.8. On average, the TIE instruction set extensions are 10 times faster than the DSP implementations. Due to the high optimization degree, the PEs outperform the other approaches significantly. C. Discussion Using the results presented above, it can now be discussed, which mapping for the presented case study should be used. Therefore, four exemplary system configurations are considered (Fig. 6). As input image size VGA resolution (640x480 pixels) has been chosen. The target throughput is arbitrarily set to 25 fps. Configuration (A) is the reference system solely consisting of the LX2 base processor. Configuration (B) is the extended LX2 with all SFUs and the application is all implemented in SW. Configuration (C) is used if all FPGA Resources [LUT] 21,320 49,821 28,969 38,570 System Frequency @ 25 fps [MHz] 2,370.7 217.4 7.7 85.4 computational operations are mapped to HW accelerators. Hence, the processor is solely used for control tasks and in this configuration only the LX2 base core is needed. Configuration (D) shows a heterogeneous architecture where the application is partly mapped onto HW accelerators and partly implemented in SW. Here, not all SFUs need to be included in the extended LX2 core. Table VI gives an estimate of the FPGA ressources and the system frequency needed to deliver the target throughput of 25 VGA images per second. The processing times per frame result from a pipelined processing of the incoming image. The duration of the pipeline stages is determined by the maximum computation time per frame of the PEs and the ASIP. To compare the FPGA resources requirements, only LUTs are used here as a straight forward metric. It is obvious that system (A) cannot be used for the given application, especially not for FPGA implementation. Using only the ASIP for signal processing (B) is also not applicable for FPGA implementation, since the processor cannot be synthesized for this frequency. System (C) has the lowest system frequency. However, it is also the most application specific HW architecture and has the lowest flexibility. For a modest increase in FPGA ressource requirements, System (D) offers a low system frequency while remaining a high degree of flexibility. VI. C ONCLUSIONS In this paper, a configurable heterogeneous hardware architecture for video-based driver assistance systems is pre- 215 sented. The HW architecture consists of a configurable and extensible processor and attached HW accelerators and has been implemented on an FPGA. As an example application, traffic sign recognition has been used. The different tasks of the application have been implemented as dedicated HW accelerators and as SW on the processor using novel special functional units and instructions. Therefore, the processor is not only used for control but also for data processing tasks. For estimation of the FPGA ressources requirements, different configurations of the HW architecture were synthesized for a Virtex-5 FPGA. As performance metric, the normalized execution times of the different tasks have been evaluated. For the example application discussed here, several implementation variants have been quantitatively compared. The proposed architecture provides several design options and therefore a high degree of flexibilty. In early stages of the HW/SW design, the ASIP can be used to implement the reference code. In order to meet the performance constraints, novel special functional units and/or attached hardware accelerators can be added to the architecture. To guarantee a high degree of flexibility in terms of SW programability, the usage of instruction set extensions should be preferred as long as the target throughput for the application scenario can be met. ACKNOWLEDGMENT The work has been funded in parts by the German Federal Ministry of Education and Research (BMBF), No. 13N10718. The authors thank Norman Nolte and Sebastian Flügel at ProDesign Electronic GmbH and Andreas Tarnowsky for their contributions to this work. R EFERENCES [1] S. Kyo and S. Okazaki, “IMAPCAR: A 100GOPS invehicle vision processor based on 128 ring connected four-way VLIW processing elements,” Journal of Signal Processing Systems, vol. 62, pp. 5–16, 2011. [2] G. Stein, E. Rushinek, G. Hayon, and A. Shashua, “A computer vision system on a chip: a case study from the automotive domain,” in Workshop on Embedded Computer Vision. IEEE, 2005. [3] G. Stein, I. Gat, and G. Hayon, “Challenges and solutions for bundling multiple DAS applications on a single hardware platform,” in V.I.S.I.O.N., 2008. [4] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-time stereo vision system using semiglobal matching disparity estimation: Architecture and FPGA-implementation,” in Int. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS. IEEE, 2010. [5] N. Alt, C. Claus, and W. Stechele, “Hardware/software architecture of an algorithm for vision-based real-time vehicle detection in dark environments,” in Proceedings of the conf. on Design, automation and test in Europe, DATE. ACM, 2008, pp. 176–181. [6] Tensilica, “Xtensa configurable processors,” http://www.tensilica.com. [7] N. Beucher, N. Bélanger, Y. Savaria, and G. Bois, “High acceleration for video processing applications using specialized instruction set based on parallelism and data reuse,” Journal of Signal Processing Systems, vol. 56, pp. 155–165, 2009. [8] G. Payá-Vayá, J. Martı́n-Langerwerf, C. Banz, F. Giesemann, P. Pirsch, and H. Blume, “VLIW architecture optimization for an efficient computation of stereoscopic video applications,” in The 2010 Int. Conf. on Green Circuits and Systems. IEEE, 2010, pp. 457–462. [9] J. Khan, S. Niar, A. Rivenq, and Y. El-Hillali, “Radar based collision avoidance system implementation in a reconfigurable MPSoC,” in Int. Conf. on Intelligent Transport Systems Telecommunications, ITST. IEEE, 2009, pp. 586–591. [10] Altera, “NIOS II processor,” http://www.altera.com/. [11] S. Fontaine, S. Goyette, J. Langlois, and G. Bois, “Acceleration of a 3D target tracking algorithm using an application specific instruction set processor,” in Int. Conf. on Computer Design, ICCD. IEEE, 2008, pp. 255–259. [12] C. Claus, W. Stechele, M. Kovatsch, J. Angermeier, and J. Teich, “A comparison of embedded reconfigurable video-processing architectures,” in Int. Conf. on Field Prog. Logic and Applications, FPL, 2008, pp. 587–590. [13] H. Flatt, H. Blume, and P. Pirsch, “Mapping of a realtime object detection application onto a configurable RISC/coprocessor architecture at full HD resolution,” in Int. Conf. on Reconfigurable Computing, ReConFig. IEEE, 2010, pp. 452–457. [14] T. Tohara, “A new kind of processor interface for a system-on-chip processor with TIE ports and TIE queues of Xtensa LX,” in Innovative Architecture for Future Generation High-Performance Processors and Systems. IEEE, 2005, pp. 72–79. [15] OCP-IP Association, “Open core protocol specification, release 3.0,” http://www.ocpip.org/. [16] H. Flatt, S. Hesselbarth, S. Flügel, and P. Pirsch, “A modular coprocessor architecture for embedded real-time image and video signal processing,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, ser. LNCS 4599. Springer, 2007, pp. 241–250. [17] M.-Y. Fu and Y.-S. Huang, “A survey of traffic sign recognition,” in Int. Conf. on Wavelet Analysis and Pattern Recognition, ICWAPR. IEEE, 2010, pp. 119 –124. [18] W.-Y. Wu, T.-C. Hsieh, and C.-S. Lai, “Extracting road signs using the color information,” in World Academy of Science, Engineering and Technology, vol. 32. WASET, 2007, pp. 282–286. [19] M. D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann Publishers, 2004. [20] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems, Man and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979. [21] Texas Instruments, TMS320C64x+ DSP Image/Video Processing Library (v2.0.1), 2008, SPRUF30A. 216