A Hardware Accelerated Configurable ASIP Architecture

Transcrição

A Hardware Accelerated Configurable
ASIP Architecture for Embedded Real-Time
Video-Based Driver Assistance Applications
Gregor Schewior, Holger Flatt, Carsten Dolar, Christian Banz, and Holger Blume
Leibniz Universität Hannover, Institute of Microelectronic Systems
Appelstr. 4, 30167 Hannover, Germany
{schewior, flatt, dolar, banz, blume}@ims.uni-hannover.de
Abstract—In this paper, a flexible HW architecture for videobased driver assistance applications is presented. It comprises
a customizable and extensible processor template and several
task-specific HW accelerators. The proposed heterogeneous architecture allows utilization of the programmable processor
core for control and low data rate tasks. For the acceleration
of computationally intensive tasks of the application, special
functional units and custom instructions can be added to the
processor template to form an application specific instruction set
processor (ASIP). Moreover, dedicated HW accelerators can be
attached to the ASIP. To compare the diverse design options, a
shape detection application for traffic sign detection is utilized
as case study. It is shown that single tasks of a pure software
implementation can be significantly accelerated by usage of
special functional units by a factor of up to 35 and by usage
of HW accelerators of up to 243. The proposed architecture has
been mapped onto an FPGA and it could be shown that a realtime capable system can be realized.
I. I NTRODUCTION
In recent years, a number of driver assistance features has
been introduced to the automobile. One of the major trends
thereby is the usage of cameras for various tasks, for example
lane detection, traffic sign recognition, and obstacle detection.
The purpose is to prevent collisions with pedestrians, vehicles,
or other objects. Such advanced driver assistance (ADA) systems rely on computationally complex algorithms. Therefore,
it can be foreseen that simple controller units as used today
cannot supply sufficient performance. Thus, new concepts and
system architectures have to be investigated.
Two of the major requirements for ADA hardware architectures are performance and flexibility [1]. For vision-based
ADA systems, high throughput in terms of frames per second
(fps) and low latency increase the system’s ability to react instantaneously to sudden events, e.g. a pedestrian unexpectedly
crossing the street. For example, a vehicle moving at 50 km/h
advances 14 m in one second. That corresponds to 50 cm
between frames when considering 30 fps. It becomes obvious
that the architecture needs to provide high performance to
generate a maximum of throughput, especially for driving
safety related ADA tasks (e.g. emergency break assistant). The
need for high flexibility results from the fact that algorithms
for ADA systems are still evolving rapidly. To keep up with the
algorithmic changes, the hardware architecture should enable
978-1-4577-0801-5/11/$26.00 ©2011 IEEE
the developers to integrate those changes quickly. With a
highly flexible system, time-to-market can be kept as short
as possible.
Taking those aspects into account, two architectural trends
can be observed: System-On-Chip solutions using a standard programmable device with associated dedicated HW
accelerators [2, 3] or using specially designed application
specific processors [1]. Since these architectures are typically
implemented as integrated circuits, they demand for a huge
number of sold devices. Concerning specialized premium
ADA systems with a limited number of sold devices, an
ASIC implementation might not offer the most cost-efficient
solution. Due to their reconfigurability, field programmable
gate array (FPGA) devices offer a feasible alternative in such
a scenario. Thus, an FPGA-based platform offers the chance
to re-use a common HW basis for diverse applications.
The uprising question is, however, which system architecture mapped onto the FPGA provides short development cycles
as well as high performance. It has been shown that dedicated
HW architectures offer a significant increase in throughput,
e.g. for depth map generation in stereo image processing [4]
or detection of cars in dark surroundings [5]. Considering a
well defined modular system, design time for dedicated HW
accelerators can be kept quite short. It is, however, not feasible
to design dedicated HW accelerators for every algorithm of the
ADA application. Especially algorithms with a high degree of
data dependent control logic can be mapped more efficiently
onto a programmable device. An alternative to the usage of an
off-chip standard processor is the integration of the processor
as soft-core into the FPGA-based HW architecture. This offers
the ability to directly connect the programmable device to the
infrastructure of the dedicated HW accelerators. It also enables
the designer to adapt the processor features to the application.
Moreover, new special functional units can be defined for the
acceleration of common and application specific tasks.
The contribution of this paper is twofold. Firstly, we will
introduce a heterogeneous HW architecture. The architecture
consists of an application specific instruction set processor
(ASIP) with attached dedicated hardware accelerators. Secondly, we will discuss the design options in the mapping
process of ADA applications to a heterogeneous FPGA-based
hardware architecture. Therefore, the ASIP is used for control
209
+:$FFHOHUDWHG&RQILJXUDEOH $6,3$UFKLWHFWXUH
as well as data processing tasks. It is based on a commercially
available configurable and extensible processor template provided by Tensilica [6]. As an exemplary application for the
field of video-based driver assistance systems, we use traffic
sign recognition. The different parts of the application have
been implemented as dedicated hardware accelerators and on
the ASIP with newly generated instruction set extensions.
This paper is organized as follows: Section II reviews the
related work. In section III the configurable architecture is
described in detail. The case study is explained and mapped
onto the configurable architecture in section IV. Results are
given in section V. Finally, section VI concludes this paper.
;7HQVD
7,(
3(
3(Q
2&3%86
/RFDO
0HPRU\
Fig. 1.
65$0
,)
''5
6'5$0,)
9LGHR
,2
65$0
''5
6'5$0
&DPHUD
'LVSOD\
HW accelerated configurable ASIP architecture
II. R ELATED WORK
The benefit of application specific processor extensions is
discussed e.g. by Beucher et al. [7] or Payá-Vayá et al. [8].
These works demonstrate the capabilities of SW acceleration
by designing custom instruction set extensions for block based
motion estimation and stereo image processing, respectively.
Both approaches utilize a single instance of an enhanced
processor core as is done in this work. With the application
specific instruction sets, significant speedups by two orders of
magnitude can be reached.
Concerning task level parallelization, multi-core systems
offer a valuable solution. Khan et al. [9] have designed a
multiprocessor system that is based on the configurable NIOS
II processor [10]. Although NIOS can be extended with
custom instructions, multiple base processor instances have
been sufficient in their work. Fontaine et al. [11] combine
three extended Tensilica Xtensa processors to a multiprocessor
system for 3D target tracking.
In order to reduce the hardware overhead of the multiprocessor systems, single processors can be replaced by dedicated
hardware accelerators. Claus et al. [12] propose an FPGAbased HW architecture for automotive tasks. This architecture
combines the integrated PowerPCs of the Xilinx VirtexII Pro
with attached dynamically reconfigurable HW accelerators.
Flatt et al. [13] introduced a HW architecture that combines a
standard RISC processor with an attached FPGA onto which
a configurable coprocessor architecture is mapped.
Instead of using on- or off-chip hard-core processors, in this
paper a configurable and extensible processor is mapped as
soft-core onto the FPGA to form a System-On-ProgrammableChip (SoPC). One of the main advantages is the configurability
of the processor interfaces [14]. In So(P)Cs the interface
widths are not restricted by the pin-count of the package.
Thus, one can use for example a 128 bit processor bus for
high data throughput or closely attached HW accelerators
to the processor. However, the maximum frequency of the
soft-core processor is limited in an FPGA. To overcome this
disadvantage, special functional units and instructions need
to be added to the processor in order to speed up the target
application.
III. C ONFIGURABLE A RCHITECTURE
A. Overview
Fig. 1 shows the concept of the configurable architecture.
The focused Tensilica Xtensa LX2 processor core is a 32bit architecture to which custom instructions and register files
can be added using a verilog-like description language. The
processor is connected to the 64-bit wide multi-layer OCP [15]
system bus. The data cache is two-way associative (writeback), has a size of 32 kB and a 32 byte line size. It
is connected to the processor via a 128 bit interface. The
instruction cache size is 16 kB with 64 byte lines.
On the one hand, the processor is used for application execution and system control. In order to achieve a fine-grained
acceleration of specific functions, custom Tensilica Instruction
Extensions (TIE) can be integrated into the processor pipeline.
A fast SRAM based local memory, which is used for storing
data and instructions, is directly connected to the Xtensa.
On the other hand, a coarse-grained acceleration of computationally intensive algorithms (CIA) is realized by dedicated
processing elements (PE), which are controlled by the Xtensa.
They are started by the ASIP via function calls and operate
autonomously.
In order to support high-speed memory access that is independent to the addressing schemes, heterogeneous memory
types are attached to the system. A fast DDR2 SDRAM
is used as main memory. High data rates are supported if
the memory is addressed sequentially. In order to support
irregular addressing schemes without performance decrease,
an additional SRAM is attached to the system.
A dedicated video I/O interface connects the SoPC directly
to a camera and display [4]. This interface is directly connected
to the DDR2 SDRAM.
In general, the system achieves a high flexibility due to the
configurable ASIP and the growing library of implemented
image processing elements. The instruction set extensions of
the ASIP, the PEs and all interfaces can be easily changed and
added according to the requirements of the desired application.
210
,
C. Hardware Accelerators
In order to achieve an higher throughput at lower chip
resources, hardware accelerators - further called processing
elements (PE) - can be integrated into the system. The design
time of the PEs can be reduced if a configurable architecture
template is used. This template provides mechanisms for control and external bus transfers, which reduces the design effort
for the implementation of new algorithms to the data path.
(;(;0
[ELW
$5UHJLVWHU
[ELW
95UHJLVWHU
6,0'$/8
ELW
,QVWUXFWLRQGHFRGH
,QVWUXFWLRQ
0HPRU\
:
8SGDWH
5HJLVWHU
ILOHV
6)8V
6)8V
Fig. 2. Pipeline scheme of the Xtensa LX2 processor for an arithmetic
instruction execution showing the new vector register file, SIMD ALU, and
special functional units.
%XV
3URFHVVLQJ(OHPHQW
%XV0DVWHU
%XV6ODYH
,QSXW
),)2V
5HJLVWHU)LOH
,QWHUQDO0HPRU\
0HPRU\
$GGUHVVHV
'DWD3DWK
3DUDPHWHU
6WDUW
6WDWXV)ODJ
To achieve high overall acceleration one single functional
unit (SFU) will not be sufficient, but rather the three following
key concepts are necessary, which have been identified in [7]:
• Perform parallel computations. Perform as many computations in parallel as possible (e.g. using SIMD). This
implies a more data-greedy algorithm implementation
which requires higher bandwidth load/store operations
and increased data reuse.
• Maximize data reuse. Reuse data (e.g. intermediate
results or once-loaded pixels) as often as possible and
as soon as possible so that it can be kept in local fast
memory (e.g. register file). This eliminates the need to
load the same data several times or store intermediate
results into slow external memory.
• Efficient cache usage. Access data on the external memory in such a way that it can easily be cached. Except for
boundary conditions, always use all the data contained in
a cache line. Thus, lengthy pipeline stalls due to cache
misses are minimized.
To achieve the required overall acceleration, the following
distinct enhancements to the base processor have been implemented into which these concepts have been incorporated.
• A 16x512 bit register file with specialized load/store
instructions.
• Special functional units implement dedicated data paths
required by the particular algorithms.
• A generalized media instruction set for highly parallel
data processing using SIMD, which is not restricted to a
particular algorithm.
Fig. 2 shows the placement of the three hardware extensions
within the pipeline of the processor. All basic media processing instructions support SIMD and generally have several
modes in which they can be operated to make them areaefficient and deployable in a wide number of cases.
In addition to the 32-bit wide general purpose register file
AR which is part of the basic processor, a second register
file, called vector register file VR, has been implemented. The
size of the vector register file is 16x512 bit. It is used to
enable massively parallel computations and maximum data
reuse throughout the image processing applications. Since the
maximum external bus width is 128 bit, special load/store
operations have been implemented, which allow traditional
access schemes as well as schemes to process data over
the boundaries of word-alignment making stream-based data
processing very efficient.
5
$/8
ELW
B. ASIP
2XWSXW
),)2V
&RQWURO
&RQWURO
Fig. 3.
'DWD
Architecture Template of Processing Element
Fig. 3 shows a generic architecture template of an autonomous
PE, which is based on the work of Flatt et al. [16]. It comprises
a slave bus interface, a control unit, a master bus interface
for accessing external data, and a data path for performing
computations. An internal memory can be integrated into the
PE when needed.
Performing a computation task requires that the ASIP
transfers function calls to the processing element via the bus
slave interface first. A function call comprises data memory
addresses and defines function specific parameters. Afterwards
the PE starts processing, source data is taken from external
memories via the bus master interface or directly from internal
memories if available. After finishing computations, the PE
sets a status flag in order to receive the next task.
The performance suffers mainly from data parallelism in
the data path. Compared to load/store architectures, the PE
works as a pipeline. Using input and output FIFO memories,
data transfers with external memory and data processing can
be performed in parallel. If the data bus width is higher
than the word width of the processed data, multiplexers and
211
5*%,PDJH
,0*
aK
&RORU +LVWRJUDP +LVWRJUDP 7KUHV
6SOLWWHU *HQHUDWLRQ $QDO\VLV KROGLQJ
&6
+*
+$
7KU
%LQDU\,PDJH%,
)HDWXUH )HDWXUH
6KDSH
/DEHOLQJ ([WUDFWLRQ )LOWHULQJ &ODVVLILFDWLRQ
%,
/DE
)(
Fig. 4.
))
6&
aa
VY
PK
K
PVY
VY
6KDSH
Task graph of shape detection
Fig. 5.
demultiplexers have to be integrated into the pipeline. The
maximum data parallelism is limited by the data bus width of
both external memory and system bus. Due to the autonomous
operation of the processing elements, several tasks can be
executed on different PEs in parallel.
IV. C ASE S TUDY
A. Application
Video-based traffic sign recognition is one important example of advanced driver assistance systems. Common approaches are usually separated into three main stages [17]. In
the first stage, the input image of the current frame is analyzed
for traffic sign candidates. Using previous frames, the traffic
signs of the current frame are tracked in the second main stage.
In the last stage, a final classification of the tracked candidates
is performed.
Due to the requirement of analyzing the whole frame, the
first stage is characterized by high computational complexity.
Therefore, it is suitable as a case study for the evaluation of
the proposed architecture.
In the following, an object detection chain, which can be
used for traffic sign detection, is presented. The task graph,
which has to be executed for different typical colors of traffic
signs (e.g. red, blue, yellow and white), is shown in Fig. 4.
Computationally intensive tasks are represented by circles,
the remaining algorithms are visualized with rectangles. In
order to achieve reliable candidates, colored input images
are processed. The first task color splitter (CS) separates the
colored input image into a single color channel (e.g. red).
As described by Wu et al. [18] and Fu and Huang [17], color
separation can be simplified by a preceding color space conversion. Therefore, each 24 bit RGB input pixel is converted
to 24 bit HSV space.
Color values of traffic signs usually are fixed to small hue
intervals h and they appear with high brightness and saturation
values s and v. In order to extract these three features, each
component h, s, and v is transformed via a user defined
weighting function mh,s,v according to Fig. 5. The results
h, s, and v can be interpreted as component propabilities of
a traffic sign color. The final gray value g is computed by
multiplication and normalization of h, s, and v as shown in
Eq. 1. Fig. 5 visualizes the splitting process for the red color
channel.
Principle of color splitting
g = (
h · s · v) >> 16
(1)
In order to detect traffic sign candidates, the further processing is based on the hardware optimized real-time object
detection chain of Flatt et al. [13].
The succeeding tasks segment the color splitted gray value
image g and provide a list, which contains all traffic sign candidates. Due to illumination changes, a histogram generation
(HG) and a histogram analysis (HA) are performed in order to
calculate a single threshold value for the image thresholding
(Thr) task. The thresholding task separates background and
object pixels. Traffic sign candidates are generated via connected component labeling (Lab) and shape feature extraction
(FE) tasks. According to the object features, a filtering step
(FF) discards objects whose sizes are differing from traffic
signs.
Similar to [18] the shape of each remaining object is
classified with a template matching based approach in the last
step SC. Therefore, each object is separated into 16 regions of
same size. Using the connected component labeling result, the
number of object pixels in each region is calculated afterwards.
A template matching with reference shapes (e.g. circle, triangle
with arrow up/down, rectangle and rhomb) is used for final
shape classification.
B. Mapping
For an evaluation of the processing performance and area requirements, the single tasks of the shape detection application
are mapped onto different architectures, namely the Tensilica
Xtensa LX2 base core, the LX2 base core extended with the
VR register file and special functional units, onto processing
elements, and onto a commercial digital signal processor.
Therefore, the tasks with high processing requirements are
identified and adequately optimized.
1) Xtensa LX2 base core: For a comparison of the processing performance of SFUs, hardware accelerators and the DSP,
all the tasks of the application are implemented for the ASIP
base core as C code without usage of extensions. This means
that especially designer-defined register files, generic vector
instructions, and SFUs are not utilized. Additionally, it serves
as the reference implementation for functional verification
with the implementations for the other architectures.
212
2) Xtensa LX2 base core with extensions: The most computationally intensive algorithms of the shape detection C
reference code are accelerated by the usage of special functional units. These are the tasks CS, HG, Thr, Lab and FE,
which process each pixel of the input image. All SFUs use
the VR register file and thus are able to compute their tasks
for multiple pixels in parallel. Moreover, multiple pixels are
read and written during one instruction by using the special
128 bit load and store operations saving access time to local
and external memory. Table I summarizes the implemented
operations.
The CS task is accelerated with a complex SFU due to
highly complex computations per pixel. The performed calculations are two divisions, ten multiplications and additions
and comparison operations. Therefore, the COLORSPLIT operation is scheduled for 8 clock cycles. Due to a pipelined
execution, the average computation time is one clock cycle
per execution. With one execution of the operation, one RGB
pixel composed of 24 bit is transformed to the HSV color
space. Additionally, according to given thresholds one 8 bit
gray value is computed for the selected color. There are 15
required parameters for the calculations, which are stored in
one 256 bit VR register. For the required divisions, a nonrestoring 16 bit division function [19] is implemented which
performs within 3 clock cycles. The output gray value image
has to be stored in external memory for later usage by the
Thr task.
For the HG task, two SFUs are implemented. Due to random
read/write accesses the resulting histogram is stored in a 1 kB
local memory which enables to execute the memory access
operations within one clock cycle. The LOADHIST operation
loads from the local memory the corresponding histogram bin
content. Therefore, the load address is computed by addition
of the start memory address of the histogram and an offset
which is the pixel value of the color splitted gray value image.
The STOREHIST operation increments the previously loaded
histogram value and writes it back to local memory. Each
operation processes one pixel per execution end executes in
one cycle. For a higher performance, both of the tasks CS and
HG are executed jointly. The results of the CS task are stored
in registers of the VR register file and passed to the HG task
saving costly load operations. This is easily implementable due
to the control possibilities of the processor programmability.
The HA task is performed in software due to its low data
rate requirement of 256x32 bit values. Therefore, an adequate
threshold for the Thr task is computed using Otsu’s method
[20].
The Thr task performs the thresholding on the CS task
output image using the threshold computed by the HA task.
The VR_THRESHOLD operation processes 64 pixels of the
CS task output in parallel by using a 512 bit VR register as
input. The binary object pixels output image is stored in local
memory for further processing.
The Lab task is mapped to four SFUs and is executed
twice. During the first run, labels and a correspondence lookup
table, which is stored in local memory, for the connected
labeling, are calculated. During the second run, also labels are
calculated and the lookup table is used for the output assignment of the labels. The LABEL_1ST_ROW operation generates
labels for the first row of the binary object pixels image. It
processes eight pixels in parallel and for consecutive operation
calls intermediate results are stored in dedicated states. The
LABEL_OTHER_ROWS operation processes two pixels of the
other rows of the image in parallel. The intermediate results
are also stored in dedicated states for the usage of further processing steps. The operation also examines if a correspondence
check has to be performed and writes out the label values
to be checked in a dedicated state. This state is used by a
correspondence function implemented in software for building
the lookup table for connected component labeling. During the
second run of the algorithm, instead of the correspondence
function the LOAD_LOOKUP operation is performed, which
stores in its output register the corresponding labels from the
lookup table. An additional function is required to shift the
input register by 8 pixels to the left. This is done by the
SHIFT_8_LEFT operation. The output of the Lab task is
stored in external memory.
For the FE task, a fast 128 kB local memory, which is
attached to the processor, is used as feature accumulation
memory. There is one feature vector for every object in the
label image. Each feature vector has a size of 256 bit. It
containsinformation
object’s moments (number of
about
the
pixels,
x,
y,
x2 ,
y2 ,
xy) as well as its bounding
box. Whereas the moving of the feature vectors between
the VR register and the local memory is handled by general
load/store instructions, the determination and the feature accumulation has been mapped to the special functional unit
FE_ACC_FEATURES that uses one shared multiplier for all
multiplications (16 bit) and several adders. For fast iteration,
8 consecutive positions of the label image are loaded in a VR
register. To prevent unnecessary calculations, the loaded labels
are jointly checked if there is any valid label by the VR_EQZ
operation. If not, the next 8 labels are loaded. Otherwise the 8
labels are processed sequentially. The accumulation instruction
uses predication and writes back the results only in case of a
valid label.
Both of the tasks Lab and FE are executed jointly due
to the same output and input operands, respectively, that are
stored in a VR register. This speeds up the execution time of
the FE task due to the omitted load operations of the labels,
thus minimizing memory accesses to external memory.
3) Processing Elements: In order to accelerate the computation intensive algorithms of the case study, processing
elements for the tasks histogram generation (HG), thresholding
(Thr), labeling (Lab), and feature extraction (FE) were taken
from a growing image processing library [13].
The color splitter (CS) is implemented according to the
architecture template of Fig. 3. The data path receives and
processes one RGB input pixel in each clock cycle. Based on
dedicated multipliers and pipelined dividers, the first step is
to transfer an RGB input pixel to HSV space. As explained
in Fig. 5 and Eq. 1, the gray value is computed afterwards
213
TABLE II
S YNTHESIS RESULTS OF THE LX2 AND LX2 + EXTENSIONS FOR A
V IRTEX -5 FPGA FOR 100 MH Z TARGET FREQUENCY
TABLE I
S PECIAL FUNCTIONAL UNITS , TIE EXTENSIONS
Operation name
COLORSPLIT
LOADHIST
STOREHIST
VR_THRESHOLD
LABEL_1ST_ROW
LABEL_OTHER_ROWS
SHIFT_8_LEFT
VR_EQZ
UPACK_EXTR_X16
FE_ACC_FEATURES
Functionality
Complex operation for CS task
Loading operation for histogram content
Increment and store operation for
histogramming
Parallelized thresholding operation for
64 pixels
Labeling of first row of an image
(8 pixels parallel)
Labeling and correspondence check
(2 pixels parallel)
8 pixels left shift operation required in
the Lab task
Check if an input vector is non-zero
16 bit extraction from a register
Complex operation for FE task
Configurations
LX2 base core
LX2 + VR Reg.
LX2 + VR Reg.
LX2 + VR Reg.
LX2 + VR Reg.
LX2 + VR Reg.
LX2 + VR Reg.
LX2 + VR Reg.
+
+
+
+
+
+
CS
HG
Thr
Lab
FE
all
Resources [LUT]
21,320
33,050
45,162
36,441
33,509
33,488
36,369
49,821
Diff. to LX2 + VR
12,112
3,391
459
438
3,319
16,771
TABLE III
S YNTHESIS RESULTS OF THE PE S FOR A V IRTEX -5 FPGA
in the second step. The arithmetic functions comprise mostly
dedicated multipliers and subtractors. In order to achieve high
synthesis frequencies, the whole data path is deeply pipelined.
4) Digital Signal Processor: In order to compare the performance with a commercial processor, a mapping of the
case study onto a Texas Instruments OMAP3530 is examined,
too. This SoC comprises a 720 MHz ARM Cortex-A8 RISC,
which executes the OS, runs control tasks, and performs
simple signal processing tasks. The computation intensive
algorithms are carried out by a 520 MHz TMS320C64x+ DSP.
The computation intensive algorithms of the case study are
implemented as C code and optimized with intrinsic functions.
The implementation of the algorithms HG and Thr are based
on the optimized DSP Image/Video Processing Library of
Texas Instruments [21].
V. R ESULTS
A. Synthesis
To determine the FPGA resource requirements, FPGA synthesis was performed utilizing Xilinx ISE Design Suite 12.3
for each PE as well as for the basic ASIP template (i.e.
the Tensilica Xtensa LX2), the ASIP with the additional VR
register file, and the ASIP with the VR register file plus various
SFUs. The target FPGA is a Virtex-5 XC5VLX330 and the
target operating frequency is 100 MHz. Table II presents the
synthesis results of the ASIP and the ASIP with additional
extensions for a Virtex-5 FPGA. It has to be noted that
the generated netlist by the Tensilica tools is optimized for
ASIC synthesis. Thus, during synthesis, no FPGA specific
units such as block RAMs or HW multipliers are used. From
the HW implementation point of view, the Tensilica LX2 is
not the optimal soft-core processor for FPGA based systems.
However, the provided toolflow for processor extensions and
SW generation enables fast development cycles. The FPGA
resource requirements for LX2 can therefore only be given in
look-up tables [LUT]. Table III presents the synthesis results of
the PEs in terms of LUTs, BRAM and 18 bit HW multipliers.
PE
LUTs
BRAM [kB]
CS
HG
Thr
Lab
FE
1,709
985
1,029
1,997
1,929
16
20
16
36
144
18 bit HW
multipliers
6
0
0
0
3
The presented synthesis results are the results before place and
route and no memory or caches were synthesized. Additionally
to the FPGA synthesis, a synthesis with Synopsys Design
Compiler v2010 was performed for the LX2 base core and
the LX2 base core with all presented extensions for a high
performance process technology from TSMC (90 nm-GT)
(worst case estimation) without synthesis of memories and
caches for a target frequency of 373 MHz. The size of the
LX2 base core increased from 147,918 to 417,790 equivalent
gate count. This corresponds to a factor of 2.8. The Tensilica
tools also provide gate count numbers of the configured
processor and the user-defined extensions, i.e. register files,
states, SFUs, etc.. Those numbers are approximations and were
confirmed by the synthesis. The resulting deviation between
both numbers was less than 3%.
Compared to the LX2 base core utilizing 21,320 LUTs,
there is an increase up to 49,821 LUTs for the extended cores.
This corresponds to a factor of 2.3. The biggest impact has
the register file and the CS SFU. It has to be remarked that the
difference of required LUTs between the LX2 base core with
all extensions and the LX2 base core with the VR register
file does not equal the sum of the differences of required
LUTs between the LX2 base core with the VR register file
and single functional units and the LX2 base core with the
VR register file. Thus, the number of required LUTs for the
single extensions cannot be added due to the non additive
correlation.
B. Performance
Table IV shows the required execution times in clock cycles
per pixel for the tasks CS, HG, CS + HG, Thr, Lab, FE, Lab
+ FE incl. load and store operations to/from RAM for the
evaluated architectures. In table V the corresponding speedup
214
TABLE IV
N ORMALIZED EXECUTION TIME IN CLOCK CYCLES PER PIXEL OF THE
(A)
(B)
LX2
PROCESSING TASKS ON THE EVALUATED ARCHITECTURES
Task
CS
HG
CS + HG
Thr
Lab
FE
Lab + FE
Total
LX2 base core
242.8
8.44
251.24
9.4
36.04
23.13
59.17
319.81
extended LX2
7.19*
1.06
15.3*
23.55
HW acc.
1
1
2
0.079
1
0.2
1.2
3.279
LX2
DSP
187.6
2.2
189.8
4.1
28.3
10.8
39.1
233
All Extensions
(C)
(D)
LX2
LX2
TABLE V
S PEEDUP FACTORS OF THE PROCESSING TASKS COMPARED TO THE LX2
CS
HW acc.
242.8
8.44
125.62
118.99
36.04
115.65
49.3
97.5
Thr
FE
...
CS
All PEs
BASE CORE
ASIP + ext.
34.94*
8.87
3.87*
13.58
VR
Thr, Lab
(*) The tasks CS and HG, and Lab and FE can be executed jointly on the extended LX2 to maximize data reuse and
minimize access to external memory
Task
CS
HG
CS + HG
Thr
Lab
FE
Lab + FE
Total
VR
DSP
1.29
3.8
1.32
2.29
1.27
2.14
1.5
1.37
Fig. 6.
HG
FE
Exemplary system configurations.
TABLE VI
FPGA RESOURCES AND SYSTEM FREQUENCY FOR EXEMPLARY SYSTEM
CONFIGURATIONS
Configuration
(A)
(B)
(C)
(D)
(*) The tasks CS and HG, and Lab and FE can be executed jointly on the extended LX2 to maximize data reuse and
minimize access to external memory
factors are given. The clock cycle counts for the LX2 base core
and the extended ASIP are generated by a cycle-accurate ISS
simulator provided by Tensilica including memory latencies
and cache misses. For the DSP, the cycle count of each
algorithm is estimated by simulation with the Code Composer
Studio 3.3 and also includes memory latencies. Equivalent
to the proposed system, the comparison is done at VGA
resolution. The results show that the PEs outperform all
evaluated approaches significantly in terms of throughput. A
speedup between 8.44 up to 242.8 for the CS task is reached.
In comparison utilizing the proposed SFUs a, speedup between
3.4 and 34.94 is possible.
Comparing with the proposed implementation times, the
DSP outperforms the ASIP implementation without TIE extensions with a speedup of 1.27-3.8. On average, the TIE
instruction set extensions are 10 times faster than the DSP
implementations. Due to the high optimization degree, the PEs
outperform the other approaches significantly.
C. Discussion
Using the results presented above, it can now be discussed,
which mapping for the presented case study should be used.
Therefore, four exemplary system configurations are considered (Fig. 6). As input image size VGA resolution (640x480
pixels) has been chosen. The target throughput is arbitrarily
set to 25 fps. Configuration (A) is the reference system
solely consisting of the LX2 base processor. Configuration
(B) is the extended LX2 with all SFUs and the application
is all implemented in SW. Configuration (C) is used if all
FPGA Resources
[LUT]
21,320
49,821
28,969
38,570
System Frequency
@ 25 fps [MHz]
2,370.7
217.4
7.7
85.4
computational operations are mapped to HW accelerators.
Hence, the processor is solely used for control tasks and
in this configuration only the LX2 base core is needed.
Configuration (D) shows a heterogeneous architecture where
the application is partly mapped onto HW accelerators and
partly implemented in SW. Here, not all SFUs need to be
included in the extended LX2 core.
Table VI gives an estimate of the FPGA ressources and the
system frequency needed to deliver the target throughput of
25 VGA images per second. The processing times per frame
result from a pipelined processing of the incoming image. The
duration of the pipeline stages is determined by the maximum
computation time per frame of the PEs and the ASIP. To
compare the FPGA resources requirements, only LUTs are
used here as a straight forward metric. It is obvious that system
(A) cannot be used for the given application, especially not for
FPGA implementation. Using only the ASIP for signal processing (B) is also not applicable for FPGA implementation,
since the processor cannot be synthesized for this frequency.
System (C) has the lowest system frequency. However, it is
also the most application specific HW architecture and has
the lowest flexibility. For a modest increase in FPGA ressource
requirements, System (D) offers a low system frequency while
remaining a high degree of flexibility.
VI. C ONCLUSIONS
In this paper, a configurable heterogeneous hardware architecture for video-based driver assistance systems is pre-
215
sented. The HW architecture consists of a configurable and
extensible processor and attached HW accelerators and has
been implemented on an FPGA. As an example application,
traffic sign recognition has been used. The different tasks
of the application have been implemented as dedicated HW
accelerators and as SW on the processor using novel special
functional units and instructions. Therefore, the processor is
not only used for control but also for data processing tasks.
For estimation of the FPGA ressources requirements, different
configurations of the HW architecture were synthesized for a
Virtex-5 FPGA. As performance metric, the normalized execution times of the different tasks have been evaluated. For the
example application discussed here, several implementation
variants have been quantitatively compared.
The proposed architecture provides several design options
and therefore a high degree of flexibilty. In early stages of
the HW/SW design, the ASIP can be used to implement the
reference code. In order to meet the performance constraints,
novel special functional units and/or attached hardware accelerators can be added to the architecture. To guarantee a high
degree of flexibility in terms of SW programability, the usage
of instruction set extensions should be preferred as long as the
target throughput for the application scenario can be met.
ACKNOWLEDGMENT
The work has been funded in parts by the German Federal
Ministry of Education and Research (BMBF), No. 13N10718.
The authors thank Norman Nolte and Sebastian Flügel at
ProDesign Electronic GmbH and Andreas Tarnowsky for their
contributions to this work.
R EFERENCES
[1] S. Kyo and S. Okazaki, “IMAPCAR: A 100GOPS invehicle vision processor based on 128 ring connected
four-way VLIW processing elements,” Journal of Signal
Processing Systems, vol. 62, pp. 5–16, 2011.
[2] G. Stein, E. Rushinek, G. Hayon, and A. Shashua, “A
computer vision system on a chip: a case study from
the automotive domain,” in Workshop on Embedded
Computer Vision. IEEE, 2005.
[3] G. Stein, I. Gat, and G. Hayon, “Challenges and solutions
for bundling multiple DAS applications on a single
hardware platform,” in V.I.S.I.O.N., 2008.
[4] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and
P. Pirsch, “Real-time stereo vision system using semiglobal matching disparity estimation: Architecture and
FPGA-implementation,” in Int. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation,
SAMOS. IEEE, 2010.
[5] N. Alt, C. Claus, and W. Stechele, “Hardware/software
architecture of an algorithm for vision-based real-time
vehicle detection in dark environments,” in Proceedings
of the conf. on Design, automation and test in Europe,
DATE. ACM, 2008, pp. 176–181.
[6] Tensilica,
“Xtensa
configurable
processors,”
http://www.tensilica.com.
[7] N. Beucher, N. Bélanger, Y. Savaria, and G. Bois, “High
acceleration for video processing applications using specialized instruction set based on parallelism and data
reuse,” Journal of Signal Processing Systems, vol. 56,
pp. 155–165, 2009.
[8] G. Payá-Vayá, J. Martı́n-Langerwerf, C. Banz, F. Giesemann, P. Pirsch, and H. Blume, “VLIW architecture
optimization for an efficient computation of stereoscopic
video applications,” in The 2010 Int. Conf. on Green
Circuits and Systems. IEEE, 2010, pp. 457–462.
[9] J. Khan, S. Niar, A. Rivenq, and Y. El-Hillali, “Radar
based collision avoidance system implementation in a
reconfigurable MPSoC,” in Int. Conf. on Intelligent
Transport Systems Telecommunications, ITST. IEEE,
2009, pp. 586–591.
[10] Altera, “NIOS II processor,” http://www.altera.com/.
[11] S. Fontaine, S. Goyette, J. Langlois, and G. Bois, “Acceleration of a 3D target tracking algorithm using an
application specific instruction set processor,” in Int.
Conf. on Computer Design, ICCD. IEEE, 2008, pp.
255–259.
[12] C. Claus, W. Stechele, M. Kovatsch, J. Angermeier, and
J. Teich, “A comparison of embedded reconfigurable
video-processing architectures,” in Int. Conf. on Field
Prog. Logic and Applications, FPL, 2008, pp. 587–590.
[13] H. Flatt, H. Blume, and P. Pirsch, “Mapping of a realtime object detection application onto a configurable
RISC/coprocessor architecture at full HD resolution,”
in Int. Conf. on Reconfigurable Computing, ReConFig.
IEEE, 2010, pp. 452–457.
[14] T. Tohara, “A new kind of processor interface for a
system-on-chip processor with TIE ports and TIE queues
of Xtensa LX,” in Innovative Architecture for Future
Generation High-Performance Processors and Systems.
IEEE, 2005, pp. 72–79.
[15] OCP-IP Association, “Open core protocol specification,
release 3.0,” http://www.ocpip.org/.
[16] H. Flatt, S. Hesselbarth, S. Flügel, and P. Pirsch, “A
modular coprocessor architecture for embedded real-time
image and video signal processing,” in Embedded Computer Systems: Architectures, Modeling, and Simulation,
ser. LNCS 4599. Springer, 2007, pp. 241–250.
[17] M.-Y. Fu and Y.-S. Huang, “A survey of traffic sign
recognition,” in Int. Conf. on Wavelet Analysis and Pattern Recognition, ICWAPR. IEEE, 2010, pp. 119 –124.
[18] W.-Y. Wu, T.-C. Hsieh, and C.-S. Lai, “Extracting road
signs using the color information,” in World Academy of
Science, Engineering and Technology, vol. 32. WASET,
2007, pp. 282–286.
[19] M. D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann Publishers, 2004.
[20] N. Otsu, “A threshold selection method from gray-level
histograms,” IEEE Transactions on Systems, Man and
Cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
[21] Texas Instruments, TMS320C64x+ DSP Image/Video
Processing Library (v2.0.1), 2008, SPRUF30A.
216

A Hardware Accelerated Configurable ASIP Architecture

Transcrição

Documentos relacionados

Design, Synthesis and FPGA-based Implementation of a 32

The Intel EM64T direct competitors

REC 2015

Wing stridulation in a Jurassic katydid (Insecta, Orthoptera

REC 2013 proceedings

Software performance estimation in MPSoC design

1 - IEAv

Real-time feature extraction from video stream data for stream

The Open Foodservice Systems Consortium

QuickSpecs

BeeProg2C Universal device programmer BeeProg2C Elnec