Implementation of H.264 video encoder based on ADSP-BF561

Implementation of H.264 video encoder based on ADSP-BF561

H.264 / AVC is the latest international standard for video coding jointly developed by ITU-T VCEG and ISO / IECMPEG, and is one of the hot technologies in the field of image communication research. H.264's video coding layer (VCL) uses many new technologies, thus greatly improving its coding performance. But this comes at the cost of exponentially increasing complexity, which also makes H.264 face huge challenges in real-time video encoding and transmission applications. Therefore, to meet the real-time requirements of image compression, it is necessary to optimize the existing H.264 codec. This article mainly discusses the hardware platform and task flow of the H.264 system, and introduces the optimization of the algorithm from the code level based on the characteristics of the DSP-based hardware platform. . Because ADI Blackfin561 is a high-performance digital signal processor launched by AD, it has a main frequency of 600MHz. To this end, this article chooses it as a hardware platform to explore an effective way to implement H.264 encoder on a DSP platform with limited resources.
1 Hardware platform
1.1 ADSP-BF561 processor
Blackfin561 is a high-performance fixed-point DSP video processing chip in the Blackfin series. Its main frequency is up to 750MHz, and its core contains 2 16-bit multipliers MAC, 2 40-bit accumulator ALUs, 4 8-bit video ALUs, and 1 40-bit shifter. The two sets of data address generators (DAGs) in the chip can provide addresses for accessing dual operands from memory at the same time, and can handle 1200M multiply-add operations per second. The chip has dedicated video signal processing instructions and 100KB of on-chip L1 memory (16KB of instruction cache, 16 KB of instruction SRAM, 64 KB of data Cache / SRAM, 4KB of temporary data SRAM), 128KB of on-chip L2 memory SRAM , With dynamic power management. In addition, the Blackfin processor also includes a wealth of peripheral interfaces, including EBIU interfaces (4 128MB SDRAM interfaces, 4 1MB asynchronous memory interfaces), 3 timers / counters, 1 UART, 1 SPI interface, 2 synchronous strings Line interface and 1 parallel peripheral interface (support ITU-656 data format), etc. The Blackfin processor fully embodies the algorithm support for media applications (especially video applications).
1.2 Video encoder platform based on ADSP-BF561
The hardware structure of the Blackfin561 video encoder is shown in Figure 1. The hardware platform uses the Analog Devices ADSP-BF561EZ-kit Lite evaluation board. This evaluation board includes an ADSP-BF561 processor, 32 MB SDRAM and 4 MB Flash. The AD-V1836 audio codec in the board can be connected to a 4-input / 6-output audio interface, while the ADV7183 video decoder and ADV7171 video encoder It can be connected to a 3-input / 3-output video interface. In addition, the evaluation board also includes a UART interface, a USB debugging interface, and a JTAG debugging interface. In Figure 1, the analog video signal input by the camera is converted into a digital signal by the video chip ADV7183A. This signal is compressed from the Blackfin561 PPI1 (parallel external interface) into the Blackfin561 chip. BF561's PPI2 output. This system can load programs through Flash, and supports serial port and network transmission. Data such as original images and reference frames during the encoding process can be stored in SDRAM.

2 Main features of H.264 video compression coding algorithm

Video codec standards mainly include two series: one is MPEG series, and the other is H.26X series. Among them, MPEG series standards are formulated by ISO / IEC organization (International Organization for Standardization), and H.26X series standards are formulated by ITU-T (International Telecommunication Union). I-TU-T standards include H.261, H.262, H.263, H.264, etc., mainly used for real-time video communication, such as video conferencing.
The H.264 video compression algorithm uses a block-based hybrid coding method similar to H.263 and MPEG-4. It uses two coding modes: intra-frame coding (Intra) and inter-frame coding (Inter). Compared with previous coding standards, in order to improve coding efficiency, compression ratio and image quality, H.264 uses the following new coding technology:
(1) H.264 divides the video coding system into two levels: video coding layer (VCL, Video Coding Layer) and network abstraction layer (NAL, Network AbstracTIon Layer) according to function. Among them, VCL is used to complete the efficient compression of video sequences, and NAL is used to standardize the format of video data, mainly providing header information to suit the transmission and storage of various media.
(2) Advanced intra prediction, which uses 4 × 4 prediction for macroblocks with more spatial detail information, and uses 16 × 16 prediction mode for flat areas, the former has 9 prediction methods, the latter There are 4 prediction methods.
(3) Inter prediction uses more types of block division. The standard defines 7 different sizes and shapes of macroblock division (16 × 16, 16 × 8, 8 × 16) and sub-macroblock division (8 × 8, 8 × 4, 4 × 8, 4 × 4). Due to the use of smaller blocks and adaptive coding, the amount of data for prediction residuals can be reduced, thereby further reducing the code rate.
(4) High-precision motion prediction based on 1/4 pixel accuracy.
(5) Multi-reference frame prediction is possible. When coding between frames, up to 5 different reference frames can be selected.

(6) Integer conversion (DCT / IDCT). For the 4 × 4 integer transform technology of the residual image, fixed-point arithmetic is used to replace the floating-point arithmetic in the conventional DCT transform. In order to reduce the coding time, it is also more suitable for transplantation to the hardware platform.
(7) H.264 / AVC supports two entropy coding methods, namely CAVLC (Context-based Adaptive Variable Length Coding) and CABAC (Context-based Adaptive Arithmetic Coding). Among them, CAVLC has higher error resistance, but the coding efficiency is lower than that of CABAC; while CABAC has high coding efficiency, but requires more calculation and storage capacity.
(8) Adopt new loop filter technology and entropy coding technology.

These new technologies of H.264 make the moving image compression technology a big step forward. It has better compression performance than MPEG-4 and H.263, and can be applied to high performance such as Internet, digital video, DVD and TV broadcasting. The field of video compression.
3 Implementation of H.264 video encoding algorithm
The improvement of H.264 in DSP requires the following three steps: C algorithm optimization on the PC, program migration from the PC to the DSP, and code optimization on the DSP platform.
3.1 C algorithm optimization on PC
According to the system requirements, this design selects the ITU Jm8.5 version baseline profile as the standard algorithm software. ITU's reference software JM is designed based on PC, so it can achieve higher coding effect. When porting video codec software to DSP, DSP system resources should be considered. The main factor to be considered is system space (including program space and data space). Therefore, the original C code needs to be evaluated, which requires The ported code has some understanding. Figure 2 shows the algorithm structure of H.264.
After understanding the algorithm structure, you also need to determine the part that requires a large amount of calculation and takes a long time in the implementation of the encoding algorithm. VC6's own profile analysis tool shows that the intra-frame and inter-frame coding parts occupy more than 60% of the overall running time. Among them, ME (Move EsTIMaTIon, motion estimation) takes up more of it. Therefore, the focus of porting and optimization should be on the motion estimation part, so the code structure should be adjusted.
(1) Substantially delete unnecessary files and functions

Due to the selection of baseline and single reference frame, many files and functions can be deleted, including B-frames, SI slices, SP slices and data segmentation, hierarchical coding, weight prediction mode, CABAC coding mode and other unsupported features Redundant program code, including rtp.c, sei.c, leaky_bucket.c, In-trafresh.c files, related header files, and correspondingly defined global variables and functions in the global.h header file, in addition, Field_related global variables and local variables such as top_pic, bottom_pic, layered coding, multi-slice segmentation, and FMO, field coding / frame field adaptive coding / macroblock adaptive coding, prediction, reference frame sequencing, and input can be deleted Output and decoder cache operations, etc .; you can also delete the relevant redundant codes such as random intra macroblock refresh mode and weight prediction mode (such as making the encoder use NAL code stream instead of RTP format), and delete rtp.c; sei.c contains some auxiliary coding information (not coded into the code stream), if not used, you can also delete leaky_bucket.c used to calculate the parameters of the leak buffer.
(2) Rewrite of the configuration function

Since JM's system parameter configuration is achieved by reading the encoder.cfg file, the parameter configuration can be changed from reading the file to implementing by initializing the centralized assignment function, which reduces both the amount of code and the limited memory. The occupation of space and reading time increase the overall encoding speed of the encoder. For example: the variable input-> img_height defined as int type can be directly rewritten as input-> img_height = 288 (CIF format).
(3) Remove redundant printing information

In order to facilitate debugging and algorithm improvement, JM keeps a lot of printing information. In order to improve the encoding speed and reduce the storage space consumption, these information can be deleted completely, such as a lot of trace information and encoding data statistics files. If lor.dat and stat.dat only need to be used for debugging on the PC, there is no need to port to the DSP platform, and the code related to this part can be completely removed. However, the basic information required during debugging (such as code rate, signal-to-noise ratio, coding sequence, etc.) should be kept for reference.
The adjustment can make the structure and capacity of the code more streamlined, thus preparing for the subsequent transplantation on the DSP.
3.2 Program migration from PC to DSP
To port the streamlined program on the PC side to the ADSP-BF561 development environment, Visual DSP, to enable it to initially run, the main considerations are syntax rules and memory allocation.
(1) Remove all functions not supported by the compilation environment

The main purpose is to remove some time-related functions, modify file operations to read file data cache operations, delete SNR information collection, and other codes that are not required for DSP platform implementation. It should also be noted that the declaration of the function and the type of data structure should conform to the C language format of the DSP.
(2) Add code related to hardware
The code includes system initialization, output module code, interrupt service program and code rate control program.
(3) Configure the LDF file

Because the code and data that are just transplanted are often very large, the SRAM must not be able to fit in, and there will be problems with the link. At the beginning, it is best to put all the programs and data in SDRAM, so that there will be no problem with the link. Stack and heap are similar, they are put in SDRAM first. Generally, at the beginning, what is often needed is a program that can run correctly, and the speed falls second.
(4) Solve the problem of Malloc

Under DSP development, malloc is a problem that needs to be solved. If you apply for memory dynamically, even if it can run, the result is often wrong. Therefore, it is best to perform static allocation, which can be allocated in the form of an array.
After the transplantation is completed, the H_264 encoding based on the ADSP-BF561 processor can be realized. At this time, if the speed cannot meet the requirements of real-time encoding, it can be further optimized.
4 Code optimization on DSP platform
The main methods for optimizing code in the Visual DSP development environment are C-level optimization and assembly-level optimization.
4.1 C-level optimization

Through the profile analysis tool of VC6, it is found that the focus of transplantation and optimization should be on the motion estimation part. The author chooses the diamond search method after comparing various algorithms. The DS algorithm can use two search templates, namely a large template with 9 search points LD-SP (Large Diamond Search Pattern) and a small template with 5 search points SDSP (Small Diamond Search Pattern). The schematic diagram of its diamond search is shown in Figure 3. When searching, first use the large template calculation. When the minimum block error SAD point appears at the center point, then replace the large template LDSP with SDSP for matching operation. Then this point is the optimal matching point, and then the search ends, otherwise the SPSS search will continue with this point as the search center.


The JM experiment confirmed that this method can save about 10% of the running time, and the code amount does not increase much.
According to the characteristics of DSP and related hardware instructions, the code can be optimized as follows during design:
â—‡ Adjust the program structure. Rewrite the statements that are not suitable for DSP execution to improve the parallelism of the code.
â—‡ The use of macros. That is, some functions that are shorter, execute single, and call more times are changed to macros.
â—‡ Loop optimization is to open the for loop in C language, drain the pipeline, and improve parallelism.

◇ Calculation tabularization is to make the parameters calculated at runtime into table constant values ​​that are easy to find, so as to convert the operation calculation into a compilation operation. For example, in the process of shifting the number of bits in the quantization and inverse quantization programs, all possible values ​​can be calculated first, and the subsequent operations can obtain the values ​​by looking up the table.
â—‡ Floating point fixed point. Because Blackfin561 does not support floating-point operations, but the original program code is in the format of floating-point operations, it must be changed to fixed-point operations, and its modified execution speed will be much faster.
â—‡ Try to replace multiplication and division with logical operations. Since the execution time of the multiply and divide operation instruction is much longer than that of the logical shift instruction, especially the division instruction, the logical shift operation should be used to replace the multiply and divide operation as much as possible to speed up the running speed of the instruction.
â—‡ Make as few function calls as possible. For some small functions, it is best to use appropriate inline functions to write them directly into the main function instead, and for some functions that are not called frequently, they can also be written directly into the main function, which can reduce unnecessary Operate to increase speed.

â—‡ Reduce judgment conversion.
â—‡ Assign static memory as much as possible.
â—‡ Call the rich inline functions provided by the system.

In addition, in order to give full play to the computing power of DSP, we must also start from its hardware structure, make the most of its 8 functional units, and use software pipelines to allow programs to execute in parallel without conflict. The most time-consuming functions can also be extracted and rewritten with linear assembly to maximize the use of DSP parallelism.
4.2 Assembly level optimization
Assembly-level optimization mainly refers to the following operations:
(1) Use register resources

Blackfin561 provides 8 32-bit data registers and a series of address registers. When using registers instead of local variables, if local variables are used to save intermediate results, then using registers instead of local variables can save a lot of time accessing memory.
(2) Use special instructions

Blackfin561 provides the maximum value, minimum value, absolute value, CUP and a large number of video-specific instructions. It should be possible to use multiple-bit instructions to access fewer bits of data. By using these instructions can greatly improve the speed of code execution. If you use int type (32-bit) to access two short (16-bit) data, you can put them in the upper 16-bit and lower 16-bit fields of the 32-bit register. In this way, data reading efficiency can be doubled, thereby reducing the number of memory accesses.
(3) Use parallel instructions and vector instructions

Each general instruction in ADSP-BF561 can be executed in parallel with one or two memory access instructions, which is conducive to the full-load operation of ADSP-BF561's pipeline and more fully utilizes the data processing capabilities of ADSP-BF561.
(4) Reasonably store the repeatedly called program segment

Put the repeatedly called program segments (such as DCT transform and IDCT transform) in the on-chip program storage area, and put the frequently used data segments (such as coding table) in the on-chip data memory, and put the ones that are not commonly used Programs and data segments are placed in off-chip memory to avoid unnecessary and repeated movement of programs or data.
(5) Reasonable use of internal and external memory

There is only 256KB of storage space in the BF561 chip, so the current frame, reference frame and reconstruction frame of the current frame must be placed in the off-chip memory. If the compressed code stream is read by the host, it can also be placed off-chip. Other data such as program codes, global variables, VLC code tables, and intermediate data generated by each encoding module can be placed on-chip.
(6) Use of DMA

Since the access speed of CPU to off-chip memory is usually dozens of times slower than that of access to on-chip, the transmission of off-chip data usually becomes the bottleneck of the program runtime. In this way, even if the code efficiency is very high, the pipeline will be seriously blocked because of waiting for data . The effective method to solve this problem is to use DMA to transfer data. The program is coded macroblock by macroblock. While encoding the current macroblock, the data of the next macroblock and the used reference frame data are transferred from off-chip to on-chip by DMA. After the current macro block is motion compensated , DMA transfers the reconstructed macroblock from on-chip to off-chip. In this way, the CPU only operates on-chip data, so that the pipeline can proceed smoothly, and the compressed code stream is written at a time interval one by one codeword, and the CPU can directly write to the off-chip.
5 Conclusion

After the optimized program of the corresponding function rewritten in the ADSP-BF561 assembly language is debugged and run, the efficiency of the DCT and IDCT parts is increased by approximately 15 times, and the efficiency of the deblocking filtering part is increased by approximately 6 to 7 times. For other functions in the module, good optimization results are also obtained. It shows that its optimization work has indeed achieved good results.

Driver For SMT Machine contains  Driver For JUKI SMT MachineDriver For SAMSUNG SMT Machine,  Driver For PANASONIC SMT Machine.

Siemens Controller Driver Boards

Siemens Controller Board Card

Controller Board

Siemens Control Pcb Board

Controller Driver Boards Card

Samsung Controller Board Card

Samsung Control Pcb Board

Samsung Controller Driver Boards

Smt Controller Board Card

Controller Driver Boards






Driver For SMT Machine

Shenzhen Srisung Technology Co.,Limited , https://www.sr-smt.com

Posted on