Author |
Message |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
. I read a couple of good deep descriptions of the machinery in Apple's new M1 chip, an ARM (or AARCH64) implementation which seems to have great performance and insanely great performance per unit power. (I exaggerate.) Apple's Humongous CPU MicroarchitectureA few quick takeaways: - fast micros these days (since Alpha!) don't have single-cycle L1 cache - M1 as 3 cycles for L1D, which is best in class - M1's Firestorm A14 cores have a really wide decode stage, helped by fixed length instructions - there are just huge amounts of in-flight state Apple's A14: Attachment: Apple-A14-Firestorm-AnandTech.png Another recent high-performance implementation: Hot Chips 2020 Live Blog: IBM's POWER10 Processor on Samsung 7nmAttachment: IBM-POWER10-AnandTech.png Related previous threads:
You do not have the required permissions to view the files attached to this post.
|
Tue Nov 24, 2020 3:34 pm |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 675
|
Ibm has new standard for BCD math (2000?), for new 360 style mainframe computers ( I have no idea what they called today) for things like COBOL, or TAXES or BANKING. I don't see that feature listed with modern cpu's. Is that patented just for IBM, hidden as some special upgrade feature, or they are so busy having a all singing and dancing cpu, they have never thought about that feature?
|
Tue Nov 24, 2020 8:42 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Quote: Ibm has new standard for BCD math (2000?), for new 360 style mainframe computers ( I have no idea what they called today) for things like COBOL, or TAXES or BANKING. I don't see that feature listed with modern cpu's. Is that patented just for IBM, hidden as some special upgrade feature, I wonder if it is considered part of the floating-point unit (decimal floating-point) I think: yikes! on a branch miss a lot of cycles would be wasted for the humongous architecture.
_________________Robert Finch http://www.finitron.ca
|
Tue Nov 24, 2020 9:01 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
According to this overview https://threadreaderapp.com/thread/1331 ... 03104.htmlApple have added some mechanisms to help their case: a mode with strong memory ordering to help with x86 emulation; something which speeds up reference counting which helps Swift programs; something which specifically helps with JavaScript.
|
Thu Dec 03, 2020 6:18 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
The mysterious “something”. I would think custom instructions in the AArch64 set could have compatibility issues with future ARMs.
I wonder what the percentage improvement in processing speed is compared to a smaller four-wide machine is and versus power consumption as well. For most apps it’s lucky if two instruction execute at the same time. I would think an eight wide machine would sit idle a lot of the time. With all the functional units, bypassing must be pretty large. Fixed length instructions probably really help the design, otherwise a lot of pipelining would be needed in the decode.
_________________Robert Finch http://www.finitron.ca
|
Fri Dec 04, 2020 5:43 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
> For most apps it’s lucky if two instruction execute at the same time It does feel like that, from a coding perspective. But seeing how many machines go to such expense to execute more, it can't be so! Perhaps we need a nice graphical simulation of an out of order machine to see which instructions get dispatched, which get stalled, and which get retired.
(It is possible that boring everyday code doesn't have much instruction level parallelism to exploit, but highly optimised crucial routines do, and that's where performance really counts.)
|
Fri Dec 04, 2020 10:34 am |
|
|
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
It looks also plausible to me that Apple be heavily working on their own fork of the LLVM compiler to take the most advantage of their new A14 processors. From the point of view of developers, this should only take a recompile of their apps to get the benefits.
|
Sat Dec 05, 2020 12:29 pm |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 675
|
It hard to say how faster it will be. Best case timing is not average timing , and the memory cache affects timing of the whole system. You need balanced system so all has fair share of memory.
Faster is realitive to the user. A better mouse click reponse affects more people than 5% increase in floating point division.
|
Sun Dec 06, 2020 12:04 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
Good point that compiler improvements might bring further gains. (Similarly compiler related, I gather it's an advantage to the M1's emulation of x86 that modern x86 code tends to use a fairly regular subset of x86.) I found a really nice presentation on the limits to ILP: taking an ideal CPU, seeing how much instruction level parallelism might ideally be extracted from an instruction stream, and then gradually refining the machine (and compiler) to more realistic conditions: http://www.cse.uaa.alaska.edu/~afkjm/cs ... ations.pdfProbably via this discussion, one of many about M1: https://news.ycombinator.com/item?id=25257932where we see Quote: ...for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts...
and Quote: I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions. Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.
See also Why do ARM chips have an instruction with Javascript in the name (FJCVTZS)?which is probably the JS assist: one instruction replaces several, for an operation commonly needed by JS engines.
|
Sun Dec 06, 2020 4:14 pm |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
BigEd wrote: (Similarly compiler related, I gather it's an advantage to the M1's emulation of x86 that modern x86 code tends to use a fairly regular subset of x86.) Interesting that you brought that up. When I was looking to port a compiler to my M65C02A soft-core, I did a quick tally of the x86 instructions that the compiler used. As you suggest, for that Pascal compiler, the list shown is below is remarkably short compared to the total number of instructions that the x86 processor itself supports. Code: mov dst,src rep movsb lea dst,src cmp dst,src repe cmpsb push src pop dst not dst and dst,src or dst,src add dst,src sub dst,src imul src idiv src call dst ret n jmp dst jl dst jle dst je dst jne dst jg dst jge dst
_________________ Michael A.
|
Mon Dec 07, 2020 2:36 pm |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 675
|
Confused here. I have not followed apple, but just what are we emulating that needs x86 codes?
|
Mon Dec 07, 2020 10:15 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
Apple is in the process of changing their consumer computers from x86 to ARM, and to provide some backward compatibility have an ahead-of-time translation from x86 to ARM. (Also some just-in-time capability I think.) So, with this technology, which they call Rosetta2, users of the new computers can run older software which hasn't yet been ported to ARM. It turns out the performance isn't too bad, which is notable.
|
Mon Dec 07, 2020 10:40 pm |
|
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 675
|
Apple proves what "cheap is" umm best CPU, works for computers. I still remember Dr Dobbs and putting 512KB on a MAC. Ben.
|
Mon Dec 07, 2020 11:12 pm |
|
|
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
There's little doubt that Apple will provide a smooth transition. In the past they already switched from 68000 to PowerPC and then to Intel. I recall the times where "Rosseta" executed 68000 code in PowerPC macs and you didn't even know unless you looked at the system monitor. IIRC, the technology consists on real time, just-in-time, conversion of target machine code into native code in blocks as the execution progresses, in a way that is transparent to the user and is only performed a single time for any given piece of machine code. So it's not like a machine code interpreter, but a true machine code translator, this is why it is so fast. I also believe that translation from Intel instructions to ARM code is potentially a lot more efficient than 68000 to PowerPC, because ARM instructions and addressing modes are much more alike to Intel's than PowerPC ever was to 68000. [Time to buy a "short" position in Intel stock...]
|
Mon Dec 07, 2020 11:14 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
Nice graph in here of Apple's ARM experience: they've been making fully custom ARMs for 6 years now. https://www.cs.utexas.edu/~bornholt/post/z3-iphone.htmlAlso a piece here (from 2018) about cache latencies over various generations: https://www.anandtech.com/show/13392/th ... -secrets/3
|
Tue Dec 08, 2020 12:49 pm |
|