Neon Programming
Start Date: Please contact us
Prerequisite:
Knowledge of 4T / V5TE instruction set
Practical labs:
Labs are run under DS-5 or GCC/Trace32
Skills Gained: After completing this training, you will be able to:
- This course has been designed for programmers wanting to run multimedia algorithms on NEON Single Instruction Multiple Data execute units
- Each instruction family is detailed, first at assembly level, and then at C level using macros developed present in arm_neon.h file
- Several tricky usage of processing instructions are provided
- Vector and vector element load / store instructions are studied and guidelines for organizing data in memory are provided to minimize the number of memory accesses
- The underlying cache operation as well as preload mechanisms (instruction and hardware prefetch) are detailed to explain how a processing can be pipelined
- The course shows how DSP typical algorithms such as FIR and FFT can be vectorized and then optimized to be executed on NEON unit Documentation
Course Outline:
1. CORTEX-A8 AND CORTEX-A9(MP) ARCHITECTURE [2-hour]
• Data path, studying how data are loaded from external memory and copied into level 1 and possibly level 2 caches
• Programmer’s model
• Highlighting coherency issues when data are shared by several cores, purpose of the SCU implemented in Cortex-A9
• Cortex-A8 and Cortex-A9 instruction pipeline, branch predictors
2. INTRODUCTION TO NEON/VFPv3 [2-hour]
• Clarifying the resources shared by NEON and VFP
• Register bank, Q registers, D registers
• Data types
• Vector vs scalar
• Related system registers
• Alignment issues
• Enabling NEON/VFP
3. NEON INSTRUCTION SYNTAX [2-hour]
• Instructions producing wider / narrower results
• Instructions modifiers
• Selecting the shape
• Selecting the operand / result type
• Syntax flexibility
• Declaring initialized vectors in C language
• Using unions with vectors and arrays of vectors to simplify the debug
• Casting vectors
4. LOAD / STORE INSTRUCTIONS [2-hour]
• Addressing modes
• Vector load / store
• Vector load / store multiple
• Element and structure load / store instructions
– Single element to 1 lane
– Single elements to all lanes
• Optimizing the ordering of data in memory to take benefit of 2-, 3- and 4- element structures
• Example: managing audio samples
• Processor acceleration mechanisms: store merging buffers
5. DATA TRANSFER INSTRUCTIONS [1-hour]
• Move
• Swap
• Table lookup
• Vector transpose
• Vector zip / unzip
• Data transfer between NEON and integer unit
• Logical AND, Bit Clear, OR, XOR
• Operations with immediate values
• Bitwise insert instructions, avoiding branches
• Count Leading zeros, ones, signs
• Normalizing floating point numbers when VFP is not implemented
• Scalar duplicate
• Extract
• Shift with possible rounding and saturation
• Bitfield revers
7. ARITHMERICAL INSTRUCTIONS [2-hour]
• Add, modulo vs saturated arithmetic
• Halving / Doubling the result
• Rounding
• Subtract
• Multiply
• Multiply accumulate / Multiply subtract
• Absolute value
• Min / Max
• Converting Floating Point numbers into Fixed point numbers
• Converting Fixed point numbers into Floating point numbers
• Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm
• Pairwise instructions
• Element comparison
– Practical lab: converting fixed-point elements into single precision floating point values and adding the resulting elements
8. NEON CODING EXAMPLES [2-hour]
• FIR filter
– Finding the NEON instructions to encode the vector algorithm
– Optimizing the code
– Using the performance monitor to tune the algorithm
• FFT (DFT)
– Optimizing the code
– Using the performance monitor to tune the algorithm