Prerequisite:
• Knowledge of Cortex-A9.
• More than 12 correct answers to Cortex-A prerequisites questionnaire
Skills Gained: After completing this training, you will be able to:
• This course is split into 3 important parts:
• Cortex-A15 / Cortex-A7 architecture
• Cortex-A15 / Cortex-A7 software implementation and debug
• Cortex-A115 / Cortex-A7 hardware implementation
• Introduction to Hypervisor new privilege mode is done at the beginning of this course
• The consequences on address translation is then explained, introducing the 2-stage translation
• Decoupling guest OS from hardware using traps to Hypervisor is studied
• The course also details the new features of the Generic Interrupt Controller v2, explaining how physical interrupt requests can be virtualized
• The course details the new approach regarding integrated timers / counters
• AXI v4 new capabilities are highlighted with regard to AXI v3
• Through sequences involving a Cortex-A15MP and a Cortex-A7MP, the hardware coherency is studied, explaining how snoop requests can be forwarded by CCI-400 interconnect
• Implementation of I/O MMU-400 is also covered
Documentation
Training manuals will be given to attendees during training. Precise and easy to use, those notes can be used as a reference afterwards.
Related tutorials
• Programming with RVDS IDE (reference RV0)
• VFP programming (reference RC0)
• NEON programming (reference RC1)
Practical labs:
• Labs are run under DS-5
Course Outline:
First day
OVERVIEW OF CORTEX-A15MP AND CORTEX-A7 [1-hour]
• Cortex-A15 & Cortex-A7 architecture
• AMBA4 coherent interconnect capabilities
• Inner Shareable vs Outer Shareable attribute
• I/O MMU
• 64-Byte cacheline size, integrated L2 cache
• VFPv4 and SIMDv2
• Supported instruction sets, new integer divide instructions
• Highlighting differences between Cortex-A7 and Cortex-A15
• Configurable options
INSTRUCTION PIPELINE [1-hour]
• Global organization, triple issue capability for Cortex-A5
• Single issue capability for Cortex-A7
• Fetch / decode / rename / dispatch stages
• Loop mode
• Branch accelerators: Branch Target Buffer, Global History Buffer, Return Stack, Indirect predictor
• 2-level dynamic predictor
• Branch prediction maintenance operations
INTRODUCTION TO HYPERVISOR STATE [1-hour]
• Processor privilege levels state machine, user, guest OS, hypervisor
• Detailing the various operation modes (Bare-Metal, Hypervisor kernel and user task, Hypervisor with Guest partition)
• Asymmetric approach, no support for Virtualization of Secure state functionality
• SVC, HVC and SMC instructions
• Objective of the Hypervisor: virtualizing the hardware platform on which the guest partition is executed
• Hypervisor related instructions and registers
• Support for interrupt nesting in Hypervisor mode
• LR banking disabling when running in hypervisor mode
• List of registers that have to be saved / restored to be able to suspend / resume a guest partition
• Accessing banked registers or any Non-Secure mode while running in Hypervisor mode
• Detecting VFP/Neon utilization by a Guest partition
EXCEPTION MECHANISM [2-hour]
• Hypervisor vector table
• Utilization of Vector #5 to trap Guest partition events
• System Call into Hypervisor mode
• Asynchronous exceptions
• Virtual Interrupt and Abort bits control, IRQ, FIQ, external abort routing control
• Hypervisor exception return
• Use of Hypervisor mode in Secure State
• Taking exceptions into Hypervisor mode
• Simplifying the design of the Guest partition trap handler by using the HSR register
GENERIC INTERRUPT CONTROLLER (GICv2) [3-hour]
• Integration in a SoC based on Cortex-A15MP and Cortex-A7MP
• Highlighting the new features with regard to Cortex-A9MP, especially hypervisor related interrupts
• Steering interrupts to guest OS or Hypervisor
• Virtual CPU interface
• Split EOI functionality
• Deactivating an interrupt source from the Virtual CPU interface
• Front-end interface accessed by the Guest Kernel
• Back-end interface accessed by the Hypervisor
• Writeable Active Bits
Second day
AMBA4 [3-hour]
• AXI-4
– Burst greater than 16 beats
– Quality of Service signaling
– Multiple region interfaces
– Restrictions for Non-modifiable and Modifiable transactions
– Updated meaning of Read Allocate and Write Allocate
– Memory type requirements
– Transaction buffering
• AXI-4 stream protocol
– Byte types, data, position, null
– Byte stream
– Sparse stream
– Data merging, packing, and width conversion
– Packet boundaries
– User sideband signaling
• AXI-4 lite
– Burst length of 1
– all accesses are equivalent to AWCACHE or ARCACHE equal to b0000
– No exclusive access support
• AXI Coherency Extension (ACE)
– Shareability domains
– Coherency model, cache states
– Additional channel signals
– New channels, snoop address, snoop response, snoop data
– Studying through sequences how a load request and a store request will be handled whenever they are marked as outer shareable requests
– Distinguishing partial cacheline writes from entire cacheline writes
– Using ReadUnique, CleanUnique and MakeUnique requests
– Distributed Virtual Memory (DVM)
– DVM synchronization message
– Selecting the coherency state machine: MESI or MOESI according to the capabilities of the interconnect
– Snoop filtering
– Broadcast interface
• Exported barriers
– DMB / DSB inner shareable, outer shareable or system
– Explaining through use cases the purpose of these 3 kinds of barriers
HARDWARE IMPLEMENTATION OF CORTEX-A15 AND CORTEX-A7 [1-hour]
• Clock domains, CLK, PCLKDBG, ACLKM, ACLKS, ATCLK, PERIPHCLK
• Clock gating, related CP15 registers
• Resets, power-on reset timing diagram
• Valid reset combinations
• Power domains, cluster power gating
• Power-on reset sequence, soft reset sequence
• Automatic L1 caches invalidation
• L2 cache invalidation, using L2RESETDISABLE
• Power management, WFI / WFE, dormant mode based on L2 memory
• Maintaining coherency while CPUs are in standby state
• Interface to the Power Management Unit
• Powering down a CPU
• External debug over power down
• Neon and VFP clock gating
CCI-400 CACHE COHERENT INTERCONNECT [2-hour]
• AMBA 4 snoop request transport
• Snoop connectivity and control
• ACE master interface
• ACP slave interface
• Connecting 2 CPUs through CCI, managing coherency domains
BIG.little OPERATION [2-hour]
• BIG.little benefits with regard to heterogeneous operation, no directory in CCI400
• Migrating from one cluster to another cluster according to task load
• List of items to migrate in order to suspend / resume system execution
• Hypervisor controlled migration
• Minimizing the impact on performance
• Avoiding spurious wakeups in order to reduce consumption
• Re-routing interruptsThird dayVIRTUALIZATION EXTENSIONS [2-hour]
• New Intermediate Physical Address, 2-stage address translation
• Relationship between the IPA generated by the Guest OS and the true Physical address
• Memory translation system
• Memory management when running in hypervisor mode
• Virtual Machine Identification
• New Instructions defined by the Virtualization Extensions
• Exposing the MMU to Other Masters, IO MMU
• Emulation support, trapping load and store and executing them in Hypervisor state
• Additional security facilities
• Second-stage access permissions and attributesLARGE PHYSICAL ADDRESS EXTENSIONS SPECIFICATION (LPAE) [2-hour]
• Need to introduce support for a second stage of translation as part of the Virtualization Extensions
• New 3-level system
• Hypervisor-level address translation
• Level-1 table descriptor format
• Level-2 table descriptor format
• Attribute and Permission fields in the translation tables
• Improving the caching of translation entries by providing contiguous hints
• complete set of cache allocation hints
• Handling of the ASID in the LPAE
• CP15 registers definition
• New cache and TLB maintenance operationsMMU IMPLEMENTATION [1-hour]
• TLB organization, L1-TLB, L2-TLB
• TLB match process
• Coherent table walk
• Understanding how copies of descriptors present in memory are stored in L1TLB, L2TLB and possibly data caches
• Determining the exact cause of aborts through status registers
• Behavior when MMU is disabled
• TLB maintenance operations
OS SUPPORT – SYNCHRONIZATION OVERVIEW [2-hour]
• Inter-Processor Interrupts
• Barriers
• Cluster ID
• Exclusive access monitor, implementing Boolean semaphores
• Global monitor
• Spin-lock implementation
• Using events
• Indicating the effect of Multi Core on debug interfaces
CORESIGHT DEBUG [1-hour]
• Program Trace Macrocell
• Cross Trigger Interface and Cross Trigger Matrix for multi-processor debugging
• Adding Virtual Machine ID in the criterion used to set a breakpoint / watchpoint
• Tracking VMID change in trace output
Fourth day
LEVEL ONE SUBSYSTEM – CORTEX-A15 AND CORTEX-A7 [2-hour]
• Physically Indexed Physically Tagged caches
• Cache organization
• LRU replacement algorithm, implementation with a 2-way cache
• Speculative accesses
• Hit Under Miss, Miss under Miss
• Optional parity protection in L1 caches
• Write streaming threshold definition
• Uploading the contents of L1 caches through dedicated CP15 registers
• MESI data cacheline states
• Detailing cache maintenance operations
LEVEL TWO SUBSYSTEM – CORTEX-A15 AND CORTEX-A7 [3-hour]
• Cache organization, L2 cache bank structure
• Random replacement algorithm
• Strictly enforced inclusion property with L1 data caches, simplification of snooping
• Impact of registers slices on performance
• Uploading the contents of L2 cache through dedicated CP15 registers
• L2 prefetch engine, clarifying the utilization of the 16-entry prefetch request queue, L2PFR CP15 register
• Table walk access prefetch
• Selecting the number of lines that can be prefetched
• ACE master interface
• ACP slave interface
• By means of sequences involving a multi-core Cortex-A7/A15 and external masters, understanding how snoop requests can be used to maintain coherency of data between caches and memory
• Implementing register slices to compensate for different route delays
• Synchronization primitives, the 3 levels of monitors
GENERIC TIMER [1-hour]
• ARM generic 64-bit timers for each processor
• Virtual time vs Physical time
• Effect of virtualization on these timers
• Event stream purpose
• Kernel event stream generation
• Hypervisor event stream generation
• Gray count timer distribution scheme
• Memory-mapped counter module
• New CP15 registers
PERFORMANCE MONITORING VIRTUALIZATION EXTENSIONS [1-hour]
• Hypervisor performance monitoring
• Guest OS performance monitoring
• Lazy switching of PMU state by a hypervisor
• Reducing the number of counters available to a Guest OS
• Fully virtualizing the PMU identity registers
• Event filtering
• New CP15 registers