RoadNet-RT: High Throughput CNN Architecture and SoC Design for Real-Time Road Segmentation

Lin Bai, Graduate Student Member, IEEE, Yecheng Lyu, Graduate Student Member, IEEE, and Xinming Huang, Senior Member, IEEE

Abstract—In recent years, convolutional neural network (CNN) has gained popularity in many engineering applications especially for computer vision. In order to achieve better performance, more complex structures and advanced operations are incorporated into neural networks, which results in very long inference time. For time-critical tasks such as autonomous driving and virtual reality, real-time processing is fundamental. In order to reach real-time processing speed, a lightweight, high-throughput CNN architecture namely RoadNet-RT is proposed for road segmentation in this article. It achieves 92.55% MaxF score on KITTI road segmentation dataset. The inference time is about 9 ms per frame when running on GTX1080 GPU. Comparing to the state-of-the-art network, RoadNet-RT speeds up the inference time by a factor of 17.8 at the cost of only 3.75% loss in accuracy. What is more, on CamVid dataset its accuracy is 92.98%. Several techniques such as depthwise separable convolution and non-uniformed kernel size convolution are optimized in the hardware accelerator design. The proposed CNN architecture has been successfully implemented on a ZCU102 MPSoC FPGA that achieves the computation capability of 331 GOPS using INT8 quantization. The system throughput reaches 196.7 frames per second with input image size of 280 x 960. The source code is published at https://github.com/linbaiwpi/RoadNet-RT.

Index Terms—Road segmentation, real-time, FPGA, neural network.

I. INTRODUCTION

NOWADAYS autonomous vehicles have become one of the most promising technologies. Owing to the continuous development of Convolutional Neural Networks (CNNs), many recent research were focused on improving the accuracy performance of the perception system for autonomous vehicles, such as vehicles or pedestrians detection [1], [2], depth completion [3], road segmentation [4], [5] and object tracking [6]. However, most of these neural networks are very deep with a huge number of parameters. Even running on a state-of-the-art GPU, few of them are able to process sensor data in real-time. This prevents them being applied to time-critical tasks such as autonomous driving. Therefore, a fast lightweight CNN with reasonable accuracy is valuable to those time-critical applications.

Road segmentation is one of the fundamental perception tasks for autonomous driving, which tells the vehicles where the drivable region is. This task has been well studied by many researchers concerning to the accuracy performance measured by benchmarks. While as a time-critical task, only 3 of the existed methods are able to process in real-time as illustrated in Fig. 1, where the red line indicates the real-time processing speed at 30 frames per second (fps), and none of their throughput exceeds 40 fps. As a fundamental task prior to path planning and dynamic control, road segmentation is expected to process input images at a much faster frame rate, such that it guarantees the real-time response of an autonomous driving system. Thus, there is an urgent need of real-time road segmentation that can process each image within a very short time while maintaining good accuracy, which bridges the gap between academic research and industry practice.

In this article, we propose RoadNet-RT, a real-time road segmentation network, which is able to run in real-time on a GPU. Besides, we have summarized some optimization techniques and hardware optimization strategies.
techniques aiming to convert ordinary CNN structures into hardware friendly ones. As a demonstration, RoadNet-RT has been successfully implemented on an FPGA by applying these techniques, resulting real-time processing on hardware. The contributions of this article are summarized as following:

- A lightweight high throughput CNN named RoadNet-RT is proposed, whose segmentation accuracy is 92.55% on KITTI road segmentation leaderboard. RoadNet-RT extracts features from two branches, one shallow branch for spatial information and one deep branch for context information. Its inference time on NVIDIA GTX 1080 is about 9 ms. When comparing to the state-of-the-art RBA-Net [4], this network reduces the inference time by 94.4%, with only 3.75% loss in accuracy.
- Aimed at providing the general guidelines on how to transform a segmentation CNN into a hardware friendly one with both computation and bandwidth efficiencies, we investigate several hardware optimization techniques through a series of experiments with quantitative results. For instance, how to employ depthwise separable convolution, how to deal with convolutions with different kernel size and dilated convolutions, and whether using batch normalization are studied.
- An efficient hardware accelerator has been implemented on a ZCU102 MPSoC FPGA platform. By balancing the bandwidth and computation capability, this accelerator can process 196.7 image frames per second with INT8 quantization, equivalent to the efficiency of 331 Giga Operations Per Second (GOPS).

The rest of the paper is organized as following: Sec. II summarizes the existing research on road segmentation, real-time segmentation CNNs and the FPGA implementations of segmentation networks. In Sec. III, the proposed segmentation network model is described together with its training details. An in-depth study of network optimization techniques for hardware efficiency and accuracy performance is presented in Sec. IV. The FPGA implementation and its results are discussed in Section V and VI, respectively. Sec. VII concludes the entire paper.

II. RELATED WORK

A. Road Segmentation

Lots of research efforts have been paid on road segmentation task in KITTI. The RBA-Net proposed in [4] adopted the classical encoder-decoder structure. Instead of using the direct skip connection in U-Net [7] and SegNet [8], a residual refinement module bridged encoder and decoder parts, which consisted of reversed attention and boundary attention mechanisms. So that high-resolution spatial details were preserved for decoding. Atrous Spatial Pyramid Pooling (ASPP) module was also utilized in RBANet. For images size 360 × 500, this network could process each frame within 83 ms on TITAN X GPU.

Other CNN based road segmentation algorithms such as DEEP-DIG [13] and MAP [14] generated a precise drivable region but required heavy computational power.

In our previous work RoadNetV3 [15], we introduced Long-Short Term Memory (LSTM) to help finding the contour of the road. It extracted features via a FCN-like encoder. After that, several convolutional-LSTM layers followed to predict the contours of drivable region. It achieved 93.08% in accuracy but 300 ms per frame.

B. Real-Time Segmentation

In recent years, some researchers have shifted their focus to real-time segmentation tasks. Their solutions are generally categorized into two groups (Fig. 2), one is encoder-decoder network and the another one is bilateral network.

FPENet [16] adopted the encoder-decoder structure. By using a feature pyramid encoding block to encode multi-scale contextual features with depthwise dilated convolutions in all stages and a mutual embedding upsample module as decoder, FPENet efficiently aggregated of high-level semantic features and low-level spatial details. Through introducing an efficient spatial pyramid (ESP), ESPNet [17] brought great improvement in both speed and performance. In its improved version, ESPNet-V2 [18] further enlarged the receptive field and reduced the calculation of parameters. In [19], DABNet balanced the efficiency and accuracy via stacking lightweight blocks with different dilation rates. DFANet [20] aggregated multi-scale features from different layers to gain higher accuracy in spatial details. The lightweight backbone of DFANet guaranteed its real-time processing speed.

ContextNet [21] proposed the solution of bilateral structure for the first time. A deep but low-resolution network extracted the context information. And a shallow but high-resolution network focused on detailed spatial information. BiSeNet [22]...
inherited the solution of ContextNet and improved the feature fusion modules by creating attention residual module and feature fusion module. Via adding global pooling layer and residual layer, BiSeNet outperformed ContextNet. In ICNet [23], the authors borrowed the image pyramid thinking from PSPNet [24]. One more branch was added to acquire more spatial details. Plus, the label guided training for each branch, ICNet had better accuracy than BiSeNet but longer processing time. BiSeNet-V2 [25] improved the first version by replacing feature fusion module into aggregation module and using Seg Head to guide the loss of each feature extractor layer. Other networks like LBN-AA [26], CANet [27] also used similar structure.

Solutions other than the two mentioned above also represent good results. FarSee-Net [28] applied Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP) at the end of feature extraction layers to guarantee enough spatial information was captured. What is more, to reduce the number of operations, sub-pixel convolution was deployed, so that FarSee-Net accepted low-resolution input and generated high-resolution output.

C. FPGA Implementation of Segmentation

To accelerate the inference speed, a great amount of effort focused on FPGA implementation of segmentation neural networks. The key to hardware accelerator for CNNs was the trade-off between bandwidth and computation capability. U-Net [7] and FCN [29] are both implemented in [30]. By utilizing convolution plus board removing method, this accelerator operated transposed convolution efficiently. Its performance was 107 GOPS and supported up to 17 fps for 512 × 512 images. A straight-forward fully convolution neural network for segmentation has been proposed and implemented on FPGA [31], [32]. Without changing the channel depth for each layer and skip connections used in U-Net [7], this accelerator pushed its performance to process 79.4 fps for input size 64 × 180 × 14. Liu merged the convolution and transposed convolution into one vector multiplication unit and fused all intermediate feature maps in on-chip memory [33]. And the FPGA implementation reached 1578 GOPS, which was 57 fps for 256 × 256 × 3 images. Another hardware architecture combining the convolution and transposed convolution operations was proposed in [34]. Its computation capability was 151.5 GOPS and 94.3 GOPS for convolution and transposed convolution respectively. Besides, a 3D segmentation CNN accelerator was implemented in [35].

III. PROPOSED NETWORK

The proposed road segmentation network is inspired by ContextNet [21], BiSeNet [22] and ICNet [23]. It consists of two branches for context information and spatial information extraction respectively, as shown in Fig. 3.

The context branch is a deep network for extracting the context information, which consists of an input convolutional layer and two residual modules from ResNet18 [36]. Subsequently, the extracted features are fed to the ASPP module in order to concatenate the features from different fields of perception (dilated rates are 2, 4, 8 and 16, depth are 32 for each, in Fig. 4). Next, a Global Attention Module (GAM) is introduced to refine the context information. The GAM (Fig. 5a) is modified from the Attention Refinement Module in [22]. The GAM consists of a global average pooling layer together with a 1 × 1 convolutional layer who extracts global context feature. These refined global features are applied to context features via multiplication. The sigmoid layer decides whether to apply the global features or not. Since the context path does not have to focus on spatial details, we shrink the input image size by half in both width and height, as a step to further reduce the computation.
For spatial path, which focuses on spatial details of the input images, contains only four convolution layers. To enhance its capability of noticing details, no image resize is applied here. The context and spatial branches are fused in a residual refinement way, called Feature Fusion Module (FFM) [22] (Fig. 5b). The residual of FFM is the product of input feature map and its global attention path, including global average pooling layer, $1 \times 1$ convolutional layer, activation layers (ReLU and Sigmoid). At the end of the network, to reproduce the output with the same size as input, the output of FFM is upsampled 8 times by the bi-linear resize algorithm.

The number of channels is chosen to be factor of 32. This is based on the number of parallelisms the hardware accelerator could support, in order to maximize the efficiency of it.

### A. Training Details

This road segmentation network is implemented using Keras and trained from scratch on a single GeForce GTX 1080 GPU. All the convolutional layers were initialized using the Xavier uniform initializer [37]. During training, the batch size is set to 24. The Adam optimizer works with learning rate $1e^{-3}$. When in plateau, a reduction rate of 0.8 is applied to the learning rate. A hybrid loss function combining Dice loss and Focal loss is deployed here expecting to balance the positive and negative samples.

Data augmentation for training includes random horizontal flip, Gaussian noise adding, random brightness contrast, random blurring, etc.

### B. Dataset and Evaluation

1) **KITTI**: The dataset for training and evaluation is the KITTI road segmentation dataset, which contains 289 training images and 290 testing images. The training image size ranges from $370 \times 1224$ to $375 \times 1242$. The evaluation job is done by an online evaluation server supplied by KITTI. The evaluation (Tab. III) is divided into Urban Unmarked (UU), Urban Marked (UM) and Urban Multiple Marked lanes (UMM). URBAN_ROAD is the comprehensive evaluation of the above three.

When running on GeForce GTX 1080 GPU, this network can process each image with $280 \times 960$ pixels in 9 ms. Four samples of predictions are demonstrated in front view and bird eye view by Fig. 6 and Fig. 7 respectively, where green area represents the overlap between prediction and ground truth, red area is road in ground truth but not correctly predicted by our network, and blue area is not road but recognized as road by our network.

Tab. I shows the performance comparison among RoadNet-RT and other state-of-the-art networks. The FNR (False Negative Rate) reflects the ratio of pixels, which are road but are wrongly recognized as non-road. While the FPR (False Positive Rate) calculates the ratio of pixels, which are non-road but are wrongly classified as road. From Tab.I, we can see RoadNet-RT has much higher FNR (7.84%) than the peers. Considering moving autonomous vehicles, high FNR would pose more restrictions on the drivable region. On the contrary, a high FPR means the neural network classifies more non-road pixels as road. For example, vehicles may recognize other cars on the roadside or bush as drivable region. Thus, high FPR would cause a safety issue. In FPR column of Tab.I, RoadNet-RT’s FPR is comparable to the peers. Therefore, we consider Roadnet-RT is as safe as other state-of-the-art networks listed in Tab. I.

### Table I

**KITTI Evaluation Comparison on URBAN_ROAD Benchmark.**

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>MaxF</th>
<th>AP</th>
<th>FPR</th>
<th>FNR</th>
</tr>
</thead>
<tbody>
<tr>
<td>RBANet [4]</td>
<td>96.30%</td>
<td>89.72%</td>
<td>2.75%</td>
<td>2.50%</td>
</tr>
<tr>
<td>SSLGAN [9]</td>
<td>95.53%</td>
<td>90.35%</td>
<td>2.28%</td>
<td>4.76%</td>
</tr>
<tr>
<td>RBNNet [5]</td>
<td>94.97%</td>
<td>91.49%</td>
<td>2.79%</td>
<td>4.99%</td>
</tr>
<tr>
<td>SixelNet-II [10]</td>
<td>94.88%</td>
<td>87.75%</td>
<td>4.04%</td>
<td>3.13%</td>
</tr>
<tr>
<td>RoadNet-RT</td>
<td>92.93%</td>
<td>93.21%</td>
<td>3.86%</td>
<td>7.84%</td>
</tr>
</tbody>
</table>

### Table II

**Road Segmentation Results on the CamVid Test Dataset.**

<table>
<thead>
<tr>
<th>Methods</th>
<th>F1-measure</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td>RBANet [4]</td>
<td>96.73%</td>
<td>97.14%</td>
<td>96.30%</td>
</tr>
<tr>
<td>RoadNet-RT</td>
<td>92.98%</td>
<td>94.76%</td>
<td>91.91%</td>
</tr>
</tbody>
</table>
As shown in Fig. 8, comparing to RBANet, most of the classification errors of RoadNet-RT occurred near the boundary of the road since we choose not to include boundary attention in the model owing to the computation complexity. These errors won’t affect autonomous driving due to path planning algorithm does not consider boundary of the drivable area.

2) CamVid: Besides the well-known KITTI dataset, the RoadNet-RT has also been evaluated on the CamVid dataset to verify its effectiveness on various road scenes (Tab. II). For
F1 score, RoadNet-RT achieves 92.98% accuracy on CamVid test dataset, which is 3.74% less when comparing to the SOTA network RBANet [4]. Since the processing time of RBANet on CamVid is not provided in [4], we skip the processing speed comparison on CamVid.

IV. NETWORK OPTIMIZATION FOR HARDWARE

In this section, we summarize some guidelines to optimize specific CNNs toward FPGAs accelerator implementation. So that on-chip resources efficiency and computation efficiency FPGA design are maximized. Different from the conventional optimization techniques, the goal of this step is to balance the number of operations, number of weights and computation patterns, while remaining the accuracy within a reasonable range.

A. Depthwise Separable Convolution

Depthwise separable convolution is initially introduced in [38]. It has been widely adopted by a number of lightweight neural networks such as Xception [39], MobileNet series [40], [41]. The main idea of depthwise separable convolution is to decompose standard convolution into a $3 \times 3$ depthwise convolution and a $1 \times 1$ pointwise convolution to achieve smaller number of weights and consequently less operations. Assuming $D_K$ is the size of convolution kernel, $M$ is the depth of input feature maps and $N$ is the number of convolution kernels (also the channel number of output feature maps).

During depthwise convolution, a single filter is applied to each input channel. And then the pointwise convolution applies a $1 \times 1$ convolution to combine the outputs of the depthwise convolution. The number of weights required by standard convolution and depthwise separable convolution are calculated in (1) and (2) respectively.

\[
D_K \cdot D_K \cdot M \cdot N \quad (1)
\]
\[
D_K \cdot D_K \cdot M + M \cdot N \quad (2)
\]

Therefore, when replacing standard convolution with depthwise separable convolution, the reduction ratio of weights is

\[
\frac{D_K \cdot D_K \cdot M + M \cdot N}{D_K \cdot D_K \cdot M \cdot N} = \frac{1}{N} + \frac{1}{D_K^2} \quad (3)
\]

Besides the parameter reduction and operation number decreasing, from the hardware implementation point of view, depthwise separable convolution need not as large size accumulator as required by standard convolution. In standard convolution, every element of output feature map is the sum of $D_K \cdot D_K \cdot M$ elements. While in depthwise separable convolution, that is the sum of $D_K \cdot D_K$ and $M$ elements for depthwise convolution and pointwise convolution respectively. On the other side, separating standard convolution into depthwise convolution and point convolution requires intermediate feature map buffering, and hence demands larger bandwidth.

Applying this to RoadNet-RT proposed in this article, the total number of parameters is reduced from 756K to 134K, which is illustrated in Tab. IV. Although the accuracy loss is 1.37%, the number of parameters reduces by a factor of 5.64.

B. Large Kernel Size Convolution

The most commonly used kernel size for convolution is $3 \times 3$. However, in order to have large size of field of perception, especially in the first layer, large kernel size is usually desired ($7 \times 7$ in ResNet [36] for instance).

Algorithm 1 Cascaded Loop of Standard Convolution

```
for no in Nof do  // output channel,loop-4
    for (y,x) in (Noy,Nox) do  // feature map,loop-3
        for ni in Nif do  // input channel,loop-2
            for (ky,kx) in (K,K) do  // kernel,loop-1
                Fout[no,y,x] += F[in][no,ni,ky,kx] \ast K[no,ni,ky,kx]
                Fout += bias[no]
```

However, to deal with different kernel size filters affects either parallelism of processing or the efficiency of buffer usage. From matrix multiplication point of view (in Alg. 1), through keeping the loop-1, hardware accelerator can handle different size of filters without extra multipliers consumed. But the penalty is the parallelism of loop-1 loss. However, different size of filter requires different size of on-chip memory. Consider a feature map with size $W \cdot H \cdot C$, to buffer it for $K \cdot K$ filter, memory size $(W+K-1) \cdot (H+K-1) \cdot C$ is need. So that

\[\text{TABLE III}
\]

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>MaxF</th>
<th>AP</th>
<th>PRE</th>
<th>REC</th>
<th>FPR</th>
<th>FNR</th>
</tr>
</thead>
<tbody>
<tr>
<td>UMS_ROAD</td>
<td>91.9%</td>
<td>92.54%</td>
<td>92.75%</td>
<td>91.24%</td>
<td>3.25%</td>
<td>8.76%</td>
</tr>
<tr>
<td>UMM_ROAD</td>
<td>93.98%</td>
<td>95.19%</td>
<td>94.47%</td>
<td>93.49%</td>
<td>6.01%</td>
<td>6.51%</td>
</tr>
<tr>
<td>UIU_ROAD</td>
<td>90.79%</td>
<td>91.67%</td>
<td>91.79%</td>
<td>89.80%</td>
<td>2.62%</td>
<td>10.20%</td>
</tr>
<tr>
<td>URBAN_ROAD</td>
<td>92.55%</td>
<td>93.21%</td>
<td>92.94%</td>
<td>92.16%</td>
<td>3.86%</td>
<td>7.84%</td>
</tr>
</tbody>
</table>

\[\text{TABLE IV}
\]

<table>
<thead>
<tr>
<th>Convolution type</th>
<th>IOU</th>
<th>parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td>Standard</td>
<td>93.67%</td>
<td>756,032</td>
</tr>
<tr>
<td>Depthwise separable</td>
<td>92.30%</td>
<td>133,870</td>
</tr>
</tbody>
</table>

1Since KITTI online test sever limits the submission to be 3 times per month, therefore 20% of the training set has been split as validation set to evaluate the methods we proposed. Here we choose IOU as the main metric to estimate the performance of different methods. IOU is one of the most important and the most widely used metrics for segmentation performance evaluation.
the feature map buffer for \(7 \times 7\) filter is \(4 \cdot (W + H + 4)/(W \cdot H)\) times larger than that for \(3 \times 3\) filter.

To pursue the same perceptive field of \(7 \times 7\), three cascaded convolutional layers with kernel size \(3 \times 3\) can replace one convolutional layer with kernel size \(7 \times 7\). If so, there is no extra resource needed including both multipliers and memory. Besides, the number of operations decreases. As illustrated in Fig. 9, for input feature map size \(W \cdot H \cdot C_i\) and output feature map size \(W \cdot H \cdot C_o\), if \(7 \times 7\) filter is applied, totally \((W \cdot H \cdot 7 \times 7 \cdot C_i \cdot C_o)\) = \(49 \cdot W \cdot H \cdot C_i \cdot C_o\) GOPS costs. In case of three \(3 \times 3\) convolutional layers, \(3 \cdot (W \cdot H \cdot 3 \times 3 \cdot C_i \cdot C_o)\) = \(27 \cdot W \cdot H \cdot C_i \cdot C_o\).

The performance comparison between these two options mentioned above is shown in Tab. V. When replacing the first convolutional layer \((7 \times 7)\) with three \(3 \times 3\) convolutional layers, the accuracy loss in IOU is 0.19\%. Since there is only one layer of \(7 \times 7\) convolution, the save in operations and parameters are negligible.

In the segmentation networks, dilated convolution [42] is the most widely used method to enlarge the perceptive field without introducing more weights. Unfortunately, during convolution with dilated kernel \((3 \times 3\) with dilated rate equals 3 for instance), the region required from feature map is still \(7 \times 7\). This will introduce the dilemma described above still. The only difference is, if using three \(3 \times 3\) convolutional layers instead of one dilated \(3 \times 3\) convolutional layers with dilated rate as 3, two times more weights and two times more operations are unavoidable. However, since the dilated convolutional layer usually won’t dominant, this penalty is still affordable.

### C. Consideration of Channel Depth

In our hardware implementation, after considering the given resources on ZCU102 board, loop-2 in Alg. 1 has been unrolled with 32 feature maps processed in parallel. To maximum the computation efficiency of accelerator, it’s better that the input feature map depth of all layers align to integer factor of 32.

### D. Batch Normalization

During inference, Batch Normalization (BN) is downgraded into \(1 \times 1\) convolution and further merged into convolutional layer prior than it. The merged weights and bias follow (4) and (5), where \(W\) and \(b\) represent weights and bias respectively.

\[
W_{merge} = W_{BN} \cdot W_{conv} \\
W_{merge} = W_{BN} \cdot b_{conv} + b_{BN}
\]

Batch normalization layer is helpful for fast convergence but not always a necessary layer concerning to the accuracy (PointNet [43] for instance). The contribution of BN layer is evaluated in Tab. VII, from which we find in our segmentation neural network, BN helps to increase the accuracy by 0.28\% without too much difference in convergence. Therefore, BN layers are kept in RoadNet-RT.

Some experiments declared that BN after ReLU usually shows better result [44]. But this may vary from one network to another.

### E. Quantization

To maximize the computation capability of FPGA, fixed point operations is preferred. Quantization aware training has been performed for 8-bit and 16-bit respectively with the help of model optimization library from QKeras [45]. Brute-force quantization may lead to unacceptable precision loss. While quantization aware training restricts the bit-width during training. This not only compensates the precision loss but introduces more non-linearity.

The performance after quantization is shown in Tab. VIII. The IoU accuracy of 8-bit implementation is 92.36\%, while that of 16-bit quantization is 92.40\%. The accuracy of 16-bit quantization is 0.04\% higher than that of 8-bit quantization, but it requires twice much memory for weights storage. Here we choose the 8-bit INT quantization for hardware implementation, 1) from storage perspective, memory space for 8-bit weights is only half of that for 16-bit quantization, 2) from hardware resources perspective, each DSP48E2 core could perform two 8-bit multiplications simultaneously but only one for 16-bit multiplication [46].
TABLE VIII

PERFORMANCE OF 8-BIT AND 16-BIT QUANTIZED NETWORKS.

<table>
<thead>
<tr>
<th>Bit Width</th>
<th>IOU</th>
<th>size of parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td>float32</td>
<td>93.67%</td>
<td>2.88MB</td>
</tr>
<tr>
<td>int16</td>
<td>92.40%</td>
<td>1.44MB</td>
</tr>
<tr>
<td>int8</td>
<td>92.36%</td>
<td>0.72MB</td>
</tr>
</tbody>
</table>

TABLE IX

PERFORMANCE COMPARISON FOR DIFFERENT TECHNIQUES.

<table>
<thead>
<tr>
<th>Technique</th>
<th>IOU</th>
</tr>
</thead>
<tbody>
<tr>
<td>No optimization</td>
<td>93.67%</td>
</tr>
<tr>
<td>opt 1 - Replace 7×7 kernel</td>
<td>93.67%</td>
</tr>
<tr>
<td>opt 2 - Replace dilated convolution</td>
<td>93.62%</td>
</tr>
<tr>
<td>opt 3 - Depthwise separable convolution</td>
<td>93.07%</td>
</tr>
<tr>
<td>opt 4 - Quantization (INT8 on FPGA)</td>
<td>91.99%</td>
</tr>
</tbody>
</table>

Fig. 10. System overview of RoadNet-RT accelerator.

F. Progressive Impact of Optimization Techniques

Considering the impact on precision loss, all the optimization (opt) techniques described above have been applied to RoadNet-RT progressively. The corresponding changes in IOU precision are listed in Tab. IX. As mentioned earlier, 7×7 kernel can be computed using 3×3 convolutions and there is no degradation of accuracy. Next, we use 3×3 convolution to replace dilated convolution with different dilated rates. There are four dilated convolutions in RoadNet-RT, which accounts for a performance drop to 0.05%. Depthwise separable convolution sharply compresses the computation complexity at the penalty of reduced network capacity, resulting in an additional (0.55%) precision loss. Finally we apply fixed-point quantization to the model, which contributes to the largest precision loss (1.08%) among all optimization techniques.

V. SYSTEM-ON-CHIP IMPLEMENTATION

To fully utilize the computation resources, the whole system is partitioned into software part (done by ARM processor) and hardware part (running on FPGA). The software part job is image resize for both input and output of neural network (Fig. 3). With the help of OpenCV library [47], image resize can be easily done on PYNQ platform.

Combining the depthwise and pointwise convolution into one process engine array is possible, but may result in output feature map reshape before sending to DDR memory and consequently decrease the efficiency of the accelerator. Thereby, we decide to separately implement these two computation modules. The overview of hardware architecture is demonstrated in Fig. 10. It consists of depthwise convolution module, and pointwise convolution module, feature map buffers, weights buffers. A finite state machine controls the running order of CNN operations. All the modules mentioned above are configurable based on the on-chip resources available on the target FPGA platform.

We chose 32 as the depth of process engine array, due to 1) target ZCU102 development kit supplies 2520 DSPs and 32.1Mb BRAM, which is sufficient for 32 process engines and corresponding feature map buffers 2) considering except the input layer and output layer, the depth of all the layers in RoadNet-RT are the product of 32, therefore using 32 can maximize the utilization of each multiplier, 3) as the greatest common divisor, using 32 as depth can minimize the data transporting for convolutions whose depth is large.

A. Depthwise Convolution Module

Depthwise convolution module (Fig. 11) contains line buffers, process engines (PEs) and adder trees. As described in the previous section, to unroll the kernel loop (loop-1 in Alg. 1), line buffer is needed to generate the sliding patch. Since kernel size of all the convolutional layers in this segmentation network is 3×3, a multiplier array with length equal to 9 follows the line buffer. Correspondingly, an adder tree in the end sums the products up. To balance the computation efficiency and on-chip resources, the batch size of depthwise convolution module is set to 32.

B. Pointwise Convolution Module

To align to the depthwise convolution module to fit the same size of feature buffers, the pointwise convolution module (Fig. 12) is designed to handle 32×1 vector - 32×32 matrix multiplication. There are 3 components multiplier array, adder tree, and ReLU module form the Pointwise convolution module. If the batch normalization layer is placed before ReLU layer, it can be merged and completed by multiplier array and adder tree. Otherwise, 1 extra multiplier and 1 extra adder is necessary to perform the batch normalization operation.

C. GAM Module and FFM Module

Both GAM and FFM modules require operations with totally different computation patterns. Global average pooling
is to calculate the average value of one entire channel. Therefore, an accumulator plus one multiplier for each channel has been implemented. The following $1 \times 1$ convolution is mathematically vector-matrix multiplication, which can be either routed into pointwise convolutional module or implemented with extra resource, given the resource consumption of this operation is small. Sigmoid function is approximated by the piece-wise function and implemented using a Look-Up Table.

D. Buffers
The on-chip memory are divided into buffers for feature maps, weights and global pooling result respectively. In this design, 1) there is no biases, so that no extra buffer is needed for bias storage, and 2) since the weights occupy only small portion of the on-chip memory, so that they can be hard coded into on-chip memory.

To boost the processing speed, one effective way is to reduce the number of time data transmission (between FPGA and DDR memory). Multiple feature map buffers with size $35 \times 120 \times 32$ have been implemented as ping-pong buffers to decrease data swap as much as possible.

E. Tasks on ARM Processor
Referring to Fig. 10, the entire CNN is implemented on FPGA side. In order to fully utilize the available computation resources on SoC, the rest of the task has been assigned to the ARM processor. Thus, the whole RoadNet-RT are partitioned to both ARM processor and FPGA as shown in Fig. 13. All the three tasks are overlapped and pipelined, and this consequently speeds up the system speed.

VI. RESULTS AND DISCUSSION
The implementation tools used in this article are Xilinx Vivado HLS and MATLAB HDL Coder Toolbox. The whole system has been implemented on ZCU102 development kit, with the PYNQ system installed (The system setup is show in Fig. 14). There are 548,160 Flip-Flops (FFs), 274,080 Look-Up Tables (LUTs), 1824 (32.1 Mb) Block RAMs (BRAMs) and 2,520 DSPs on the board. The FPGA resources consumption of this accelerator for both 16-bit and 8-bit quantization formats are shown in Tab. X.

Since each DSP48E2 slice can handle two $8\times8$-bit multiplication while the number for 16-bit number is one, thus 8-bit format accelerator consumes almost the same DSP slices and BRAMs as that in 16-bit format but twice the number of input images. To maximum the computation capability of hardware, we quantize all the weights into 8-bit. When running at 250 MHz, this 8-bit accelerator’s processing speed is 196.7 fps. In Tab. XII, all the image-based road segmentation solutions in the KITTI leaderboard are summarized and compared to our solution in GPU and FPGA. Most of the existing methods cost 100 ms or longer. One of the only two real-time solutions FCN-LC [48] runs on TITAN X GPU, which requires 600-650W power supply on PC to support. Therefore, our solutions supply a well-balanced and practical way to run this the road segmentation task on embedded devices.

In this accelerator, there are 8 feature map buffers are allocated. But this number may vary according to the balance between available resources on the target FPGA and required processing speed. More feature map buffers can store more intermediate feature maps and consequently increase the processing speed. While less feature map buffers require more temporary data stored in external memory rather than on-chip ones. And thus leads to longer processing time.

The FPGA performance on the KITTI valid dataset is shown in Tab. XI. After replacing all the large kernel, dilated convolution into convolutions with uniform kernel size and quantization, when using INT8 format weights, the IOU of network on FPGA is 91.99%, which is 1.68% less than the proposed floating point RoadNet-RT.

VII. CONCLUSION
This article presents a real-time, high-throughput convolutional neural network architecture for road segmentation. Several optimization techniques are applied to reduce the number
of operations while preserving the accuracy performance. This network achieves 92.55% MaxF score on KITTI dataset with 111 fps on GTX 1080 GPU (for image size 280 × 960). More importantly, using RoadNet-RT as an example, we present a systematic approach on how to perform CNN network optimization for hardware implementation. Following this as a guideline, one can easily convert any existing CNN structure into a computation efficient, high-throughput architecture for FPGA with little loss in accuracy. Several experiments have been conducted to support the proposed approach. In the end, a SoC design has been successfully demonstrated on ZCU102 FPGA development kit, which speeds up the processing time by a factor of 1.72 comparing to its GPU implementation.

REFERENCES


Lin Bai (Graduate Student Member, IEEE) received the B.S. degree in integrated circuits design and integrated system from the University of Electronic Science and Technology of China in 2009, and the M.S. degree in electrical engineering and information technology from the Swiss Federal Institute of Technology, Zürich, in 2012. He is currently pursuing the Ph.D. degree with the Worcester Polytechnic Institute, USA. He was an FPGA engineer in industry. His current research interest includes the hardware acceleration of deep learning algorithms on FPGA and ASIC.

Yecheng Lyu (Graduate Student Member, IEEE) received the B.S. degree from Wuhan University, China in 2012, and the M.S. degree from the Worcester Polytechnic Institute, USA, in 2015, where he is currently pursuing the Ph.D. degree in electrical engineering at the Worcester Polytechnic Institute, USA. He is a Chair Professor. His current research interest includes sensor fusion, autonomous vehicle perception, and deep learning.

Ximming Huang (Senior Member, IEEE) received the B.S. degree in electrical engineering from Virginia Tech in 2001. He was a Member of Technical Staff with the Bell Labs, Lucent Technologies. Since 2006, he has been a Faculty with the Department of Electrical and Computer Engineering, Worcester Polytechnic Institute (WPI), where he is currently a Professor. His current research interests include circuits and systems, with emphasis on autonomous vehicles, deep learning, the IoT, and wireless communications.