View unanswered posts | View active topics It is currently Thu Mar 28, 2024 10:15 am



Reply to topic  [ 15 posts ] 
 Modern high performance CPUs (with ref to Apple/ARM) 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
.
I read a couple of good deep descriptions of the machinery in Apple's new M1 chip, an ARM (or AARCH64) implementation which seems to have great performance and insanely great performance per unit power. (I exaggerate.)

Apple's Humongous CPU Microarchitecture

A few quick takeaways:
- fast micros these days (since Alpha!) don't have single-cycle L1 cache - M1 as 3 cycles for L1D, which is best in class
- M1's Firestorm A14 cores have a really wide decode stage, helped by fixed length instructions
- there are just huge amounts of in-flight state

Apple's A14:
Attachment:
Apple-A14-Firestorm-AnandTech.png
Apple-A14-Firestorm-AnandTech.png [ 214.96 KiB | Viewed 4156 times ]


Another recent high-performance implementation:
Hot Chips 2020 Live Blog: IBM's POWER10 Processor on Samsung 7nm
Attachment:
File comment: IBM-POWER10
IBM-POWER10-AnandTech.png
IBM-POWER10-AnandTech.png [ 427.16 KiB | Viewed 4156 times ]


Related previous threads:


Tue Nov 24, 2020 3:34 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Ibm has new standard for BCD math (2000?), for new 360 style mainframe computers ( I have no idea what they called today) for things like COBOL, or TAXES or BANKING. I don't see that feature listed with modern cpu's. Is that patented just for IBM, hidden as some special upgrade feature,
or they are so busy having a all singing and dancing cpu, they have never thought about that feature?


Tue Nov 24, 2020 8:42 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Ibm has new standard for BCD math (2000?), for new 360 style mainframe computers ( I have no idea what they called today) for things like COBOL, or TAXES or BANKING. I don't see that feature listed with modern cpu's. Is that patented just for IBM, hidden as some special upgrade feature,
I wonder if it is considered part of the floating-point unit (decimal floating-point)

I think: yikes! on a branch miss a lot of cycles would be wasted for the humongous architecture.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 24, 2020 9:01 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
According to this overview
https://threadreaderapp.com/thread/1331 ... 03104.html
Apple have added some mechanisms to help their case: a mode with strong memory ordering to help with x86 emulation; something which speeds up reference counting which helps Swift programs; something which specifically helps with JavaScript.


Thu Dec 03, 2020 6:18 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The mysterious “something”. I would think custom instructions in the AArch64 set could have compatibility issues with future ARMs.

I wonder what the percentage improvement in processing speed is compared to a smaller four-wide machine is and versus power consumption as well. For most apps it’s lucky if two instruction execute at the same time. I would think an eight wide machine would sit idle a lot of the time. With all the functional units, bypassing must be pretty large. Fixed length instructions probably really help the design, otherwise a lot of pipelining would be needed in the decode.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 04, 2020 5:43 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
> For most apps it’s lucky if two instruction execute at the same time
It does feel like that, from a coding perspective. But seeing how many machines go to such expense to execute more, it can't be so! Perhaps we need a nice graphical simulation of an out of order machine to see which instructions get dispatched, which get stalled, and which get retired.

(It is possible that boring everyday code doesn't have much instruction level parallelism to exploit, but highly optimised crucial routines do, and that's where performance really counts.)


Fri Dec 04, 2020 10:34 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
It looks also plausible to me that Apple be heavily working on their own fork of the LLVM compiler to take the most advantage of their new A14 processors. From the point of view of developers, this should only take a recompile of their apps to get the benefits.


Sat Dec 05, 2020 12:29 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
It hard to say how faster it will be. Best case timing is not average timing , and the memory cache
affects timing of the whole system. You need balanced system so all has fair share of memory.

Faster is realitive to the user. A better mouse click reponse affects more people than 5% increase in floating point division.


Sun Dec 06, 2020 12:04 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Good point that compiler improvements might bring further gains. (Similarly compiler related, I gather it's an advantage to the M1's emulation of x86 that modern x86 code tends to use a fairly regular subset of x86.)

I found a really nice presentation on the limits to ILP: taking an ideal CPU, seeing how much instruction level parallelism might ideally be extracted from an instruction stream, and then gradually refining the machine (and compiler) to more realistic conditions:
http://www.cse.uaa.alaska.edu/~afkjm/cs ... ations.pdf

Probably via this discussion, one of many about M1: https://news.ycombinator.com/item?id=25257932
where we see
Quote:
...for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts...
and
Quote:
I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions.
Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.


See also
Why do ARM chips have an instruction with Javascript in the name (FJCVTZS)?
which is probably the JS assist: one instruction replaces several, for an operation commonly needed by JS engines.


Sun Dec 06, 2020 4:14 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
BigEd wrote:
(Similarly compiler related, I gather it's an advantage to the M1's emulation of x86 that modern x86 code tends to use a fairly regular subset of x86.)

Interesting that you brought that up. When I was looking to port a compiler to my M65C02A soft-core, I did a quick tally of the x86 instructions that the compiler used. As you suggest, for that Pascal compiler, the list shown is below is remarkably short compared to the total number of instructions that the x86 processor itself supports.
Code:
    mov dst,src
    rep movsb
    lea dst,src
    cmp dst,src
    repe cmpsb
    push src
    pop dst
    not dst
    and dst,src
    or dst,src
    add dst,src
    sub dst,src
    imul src
    idiv src
    call dst
    ret n
    jmp dst
    jl dst
    jle dst
    je dst
    jne dst
    jg dst
    jge dst

_________________
Michael A.


Mon Dec 07, 2020 2:36 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Confused here. I have not followed apple, but just what are we emulating that needs
x86 codes?


Mon Dec 07, 2020 10:15 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Apple is in the process of changing their consumer computers from x86 to ARM, and to provide some backward compatibility have an ahead-of-time translation from x86 to ARM. (Also some just-in-time capability I think.) So, with this technology, which they call Rosetta2, users of the new computers can run older software which hasn't yet been ported to ARM. It turns out the performance isn't too bad, which is notable.


Mon Dec 07, 2020 10:40 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Apple proves what "cheap is" umm best CPU, works for computers.
I still remember Dr Dobbs and putting 512KB on a MAC. Ben.


Mon Dec 07, 2020 11:12 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
There's little doubt that Apple will provide a smooth transition. In the past they already switched from 68000 to PowerPC and then to Intel. I recall the times where "Rosseta" executed 68000 code in PowerPC macs and you didn't even know unless you looked at the system monitor. IIRC, the technology consists on real time, just-in-time, conversion of target machine code into native code in blocks as the execution progresses, in a way that is transparent to the user and is only performed a single time for any given piece of machine code. So it's not like a machine code interpreter, but a true machine code translator, this is why it is so fast. I also believe that translation from Intel instructions to ARM code is potentially a lot more efficient than 68000 to PowerPC, because ARM instructions and addressing modes are much more alike to Intel's than PowerPC ever was to 68000. [Time to buy a "short" position in Intel stock...]


Mon Dec 07, 2020 11:14 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Nice graph in here of Apple's ARM experience: they've been making fully custom ARMs for 6 years now.
https://www.cs.utexas.edu/~bornholt/post/z3-iphone.html

Also a piece here (from 2018) about cache latencies over various generations:
https://www.anandtech.com/show/13392/th ... -secrets/3


Tue Dec 08, 2020 12:49 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 15 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software