Price 3245 + VAT
DURATION 2 Days

Prerequisite:

Knowledge of 4T / V5TE instruction set

Practical labs:

Labs are run under DS-5 or GCC/Trace32

Skills Gained: After completing this training, you will be able to:

  • This course has been designed for programmers wanting to run multimedia algorithms on NEON Single Instruction Multiple Data execute units
  • Each instruction family is detailed, first at assembly level, and then at C level using macros developed present in arm_neon.h file
  • Several tricky usage of processing instructions are provided
  • Vector and vector element load / store instructions are studied and guidelines for organizing data in memory are provided to minimize the number of memory accesses
  • The underlying cache operation as well as preload mechanisms (instruction and hardware prefetch) are detailed to explain how a processing can be pipelined
  • The course shows how DSP typical algorithms such as FIR and FFT can be vectorized and then optimized to be executed on NEON unit Documentation

 

Course Outline:

1. CORTEX-A8 AND CORTEX-A9(MP) ARCHITECTURE [2-hour]
• Data path, studying how data are loaded from external memory and copied into level 1 and possibly level 2 caches
• Programmer’s model
• Highlighting coherency issues when data are shared by several cores, purpose of the SCU implemented in Cortex-A9
• Cortex-A8 and Cortex-A9 instruction pipeline, branch predictors

2. INTRODUCTION TO NEON/VFPv3 [2-hour]
• Clarifying the resources shared by NEON and VFP
• Register bank, Q registers, D registers
• Data types
• Vector vs scalar
• Related system registers
• Alignment issues
• Enabling NEON/VFP

3. NEON INSTRUCTION SYNTAX  [2-hour]
• Instructions producing wider / narrower results
• Instructions modifiers
• Selecting the shape
• Selecting the operand / result type
• Syntax flexibility
• Declaring initialized vectors in C language
• Using unions with vectors and arrays of vectors to simplify the debug
• Casting vectors

4. LOAD / STORE INSTRUCTIONS  [2-hour]
• Addressing modes
• Vector load / store
• Vector load / store multiple
• Element and structure load / store instructions

– Multiple single elements
– Single element to 1 lane
– Single elements to all lanes

• Optimizing the ordering of data in memory to take benefit of 2-, 3- and 4- element structures
• Example: managing audio samples
• Processor acceleration mechanisms: store merging buffers

– Practical lab: using load with de-interleaving instructions to store all right lane samples into a vector and left lane samples into another vector

5. DATA TRANSFER INSTRUCTIONS  [1-hour]
• Move
• Swap
• Table lookup
• Vector transpose
• Vector zip / unzip
• Data transfer between NEON and integer unit

– Practical lab: clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors

 

6. LOGICAL AND BITFIELD INSTRUCTIONS [2-hour]
• Logical AND, Bit Clear, OR, XOR
• Operations with immediate values
• Bitwise insert instructions, avoiding branches
• Count Leading zeros, ones, signs
• Normalizing floating point numbers when VFP is not implemented
• Scalar duplicate
• Extract
• Shift with possible rounding and saturation
• Bitfield revers

– Practical lab: Transposing a matrix, shifting a large bitmap using vector instructions

7. ARITHMERICAL INSTRUCTIONS [2-hour]
• Add, modulo vs saturated arithmetic
• Halving / Doubling the result
• Rounding
• Subtract
• Multiply
• Multiply accumulate / Multiply subtract
• Absolute value
• Min / Max
• Converting Floating Point numbers into Fixed point numbers
• Converting Fixed point numbers into Floating point numbers
• Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm
• Pairwise instructions
• Element comparison

– Practical lab: implementing a complex multiply accumulate with NEON
– Practical lab: converting fixed-point elements into single precision floating point values and adding the resulting elements

8. NEON CODING EXAMPLES [2-hour]
• FIR filter

– Converting the scalar algorithm into a vector algorithm
– Finding the NEON instructions to encode the vector algorithm
– Optimizing the code
– Using the performance monitor to tune the algorithm

• FFT (DFT)

– Converting the scalar algorithm into a vector algorithm, understanding how circle properties can be used to process 4 angles concurrently
– Finding the NEON instructions to encode the vector algorithm
– Optimizing the code
– Using the performance monitor to tune the algorithm
Close Menu