**Price**3245 + VAT

**DURATION**2 Days

### Prerequisite:

Knowledge of 4T / V5TE instruction set

**Practical labs:**

Labs are run under DS-5 or GCC/Trace32

### Skills Gained: After completing this training, you will be able to:

- This course has been designed for programmers wanting to run multimedia algorithms on NEON Single Instruction Multiple Data execute units
- Each instruction family is detailed, first at assembly level, and then at C level using macros developed present in arm_neon.h file
- Several tricky usage of processing instructions are provided
- Vector and vector element load / store instructions are studied and guidelines for organizing data in memory are provided to minimize the number of memory accesses
- The underlying cache operation as well as preload mechanisms (instruction and hardware prefetch) are detailed to explain how a processing can be pipelined
- The course shows how DSP typical algorithms such as FIR and FFT can be vectorized and then optimized to be executed on NEON unit Documentation

### Course Outline:

**1. CORTEX-A8 AND CORTEX-A9(MP) ARCHITECTURE [2-hour]
**• Data path, studying how data are loaded from external memory and copied into level 1 and possibly level 2 caches

• Programmer’s model

• Highlighting coherency issues when data are shared by several cores, purpose of the SCU implemented in Cortex-A9

• Cortex-A8 and Cortex-A9 instruction pipeline, branch predictors

**2. INTRODUCTION TO NEON/VFPv3 [2-hour]**

• Clarifying the resources shared by NEON and VFP

• Register bank, Q registers, D registers

• Data types

• Vector vs scalar

• Related system registers

• Alignment issues

• Enabling NEON/VFP

**3. NEON INSTRUCTION SYNTAX [2-hour]**

• Instructions producing wider / narrower results

• Instructions modifiers

• Selecting the shape

• Selecting the operand / result type

• Syntax flexibility

• Declaring initialized vectors in C language

• Using unions with vectors and arrays of vectors to simplify the debug

• Casting vectors

**4. LOAD / STORE INSTRUCTIONS [2-hour]**

• Addressing modes

• Vector load / store

• Vector load / store multiple

• Element and structure load / store instructions

– Single element to 1 lane

– Single elements to all lanes

• Optimizing the ordering of data in memory to take benefit of 2-, 3- and 4- element structures

• Example: managing audio samples

• Processor acceleration mechanisms: store merging buffers

**5. DATA TRANSFER INSTRUCTIONS [1-hour]**

• Move

• Swap

• Table lookup

• Vector transpose

• Vector zip / unzip

• Data transfer between NEON and integer unit

**6. LOGICAL AND BITFIELD INSTRUCTIONS [2-hour]**

• Logical AND, Bit Clear, OR, XOR

• Operations with immediate values

• Bitwise insert instructions, avoiding branches

• Count Leading zeros, ones, signs

• Normalizing floating point numbers when VFP is not implemented

• Scalar duplicate

• Extract

• Shift with possible rounding and saturation

• Bitfield revers

**7. ARITHMERICAL INSTRUCTIONS [2-hour]**

• Add, modulo vs saturated arithmetic

• Halving / Doubling the result

• Rounding

• Subtract

• Multiply

• Multiply accumulate / Multiply subtract

• Absolute value

• Min / Max

• Converting Floating Point numbers into Fixed point numbers

• Converting Fixed point numbers into Floating point numbers

• Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm

• Pairwise instructions

• Element comparison

– Practical lab: converting fixed-point elements into single precision floating point values and adding the resulting elements

**8. NEON CODING EXAMPLES [2-hour]**

• FIR filter

– Finding the NEON instructions to encode the vector algorithm

– Optimizing the code

– Using the performance monitor to tune the algorithm

• FFT (DFT)

– Optimizing the code

– Using the performance monitor to tune the algorithm