Why Is Apple’s M1 Chip So Fast?

original link

date Feb 28, 2021

authors Erik Engheim

reading time 5 mins

category blog

Table of Contents

Definitions
Differences
Optimization

Definitions

CPU = Registers + ALUs

A CPU at its most basic level is a device with a number of named memory cells called registers and a number of computational units called arithmetic logic units (ALU).

Registers

The concept of registers is old. For example, on this old mechanical calculator, the register is what holds the numbers you are adding. Likely the origin of the term cash register. The register is where you registered input numbers

Thread

If you don’t know what a thread is, then you can think of it as the process of carrying out a task. With two cores, a CPU can carry out two separate tasks concurrently: two threads.

Processors: multi-core or out-of-order

Multi-core or Out-of-Order processors? There are two approaches to this. Add more CPU cores. Each core works independent and in parallel. Make each CPU core execute multiple instructions in parallel.

Differences

What can modern RISC architectures not do?

Modern RISC CPUs cannot do operations on numbers that are not in a register like this… For example, it cannot add two numbers residing in RAM in two different locations. Instead, it has to pull these two numbers into a separate register.

CPU is just one of the chips

The M1 is not a CPU, it is a whole system of multiple chips put into one large silicon package. The CPU is just one of these chips… M1 is a system on a chip. Meaning all the parts making up a computer are placed on one silicon chip

Unified Memory Architecture

Apple’s Unified Memory Architecture tries to solve all these problems without having the disadvantages of old school shared memory. They achieve this in the following ways:

There is no special area reserved just for the CPU or just the GPU. Memory is allocated to both processors.
No copying is needed.

Powerul GPU that is not overheating

Apple has gotten the watt usage of the GPU down, so that a relatively powerful GPU can be integrated without overheating the SoC.

Unified Memory

In the Nvidea world Unified Memory simply means that there is software and hardware which takes care of automatically copying data back and forth between the separate CPU and GPU memory. Thus from a programmers perspective Apple and Nvidia Unified Memory may look the same, but it is not the same in a physical sense.

Selling CPUs vs IPs

Here we get a big problem with the Intel and AMD business model. Their business models are based on selling general-purpose CPUs, which people just slot onto a large PC motherboard… But we are quickly moving away from that world. In the new SoC world, you don’t assemble physical components from different vendors. Instead, you assemble IP (intellectual property) from different vendors.

Conflict of SoC manufacturer vs PC manufacturers

Sure Intel and AMD may simply begin to sell whole finished SoCs. But what are these to contain? PC-makers may have different ideas of what they should contain. You potentially get a conflict between Intel, AMD, Microsoft, and PC-makers about what sort of specialized chips should be included.

x86 vs ARM ISA

AMD and Intel processors understand the x86 ISA, while Apple Silicon chips, such as M1, understand the ARM Instruction-Set Architecture (ISA).

Optimization

Specialized chips vs general-purpose chips

Instead of adding ever more general-purpose CPU cores, Apple has followed another strategy: They have started adding ever more specialized chips doing a few specialized tasks. The benefit of this is that specialized chips tend to be able to perform their tasks significantly faster using much less electric current than a general-purpose CPU core.

Types of specialized chips:

Central processing unit (CPU)
Graphics processing unit (GPU)
Image processing unit (ISP) —
Digital signal processor (DSP)
Neural processing unit (NPU)
Video encoder/decoder
Secure Enclave — encryption, authentication, and security.
Unified memory

Speed improvements for image and video editing

This is part of the reason why a lot of people working on images and video editing with the M1 Macs are seeing such speed improvements. A lot of the tasks they do can run directly on specialized hardware.

CPUs: small fast portions. GPUs: huge portions.

CPUs and GPUs don’t want their memory served the same way. Let us do a silly food analogy: CPUs want their plate of data served very quickly by the waiter, but they are totally cool with small portion sizes… This is how your GPU wants their memory: huge portions. The more the merrier.

Vertical integration

For Apple this is simple. They control the whole widget. They give you, for example, the Core ML library for developers to write machine learning stuff. Whether Core ML runs on Apple’s CPU or the Neural Engine is an implementation detail developers don’t have to care about.

Faster strategies

In principle you accomplish in a combination of two strategies: Perform more instructions in a sequence faster. Perform lots of instructions in parallel.

Increasing clock speed is no longer possible

However, today increasing the clock frequency is next to impossible. That is the whole “End of Moore’s Law” that people have been harping on for over a decade now.

Out-of-Order execution

It is the superior Out-of-Order execution that is making the Firestorm cores on the M1 kick ass and take names. It is in fact much stronger than anything from Intel or AMD and they may never be able to catch up.