Many might recognize the Tremont name as Intel mentioned it several times, the first time in December of last year. Tremont are small cores packed with 3D packaging method called Foveros in Lakefield processors. Robinson did highlight that Lakefield found its way to very innovative Surface Neo products.
“Tremont is Intel’s most advanced low-power x86 architecture to date. We focused on a range of modern, complex workloads, while considering networking, client, browser and battery so that we could raise performance efficiently across the board. It is a world-class CPU architecture designed for enhanced processing power in compact, low-power packages.”
--Stephen Robinson, Intel Tremont Chief Architect
Once we saw the top-level design and the changes in the pipeline, it became clear that this is a big change compared to a previous low power architecture called Goldmont Plus released in December 2017. Goldmont Plus is the core inside of the Gemini Lake SoC.
Improvement in single-thread performance
Intel addressed Performance per mW, Performance per mm2, and added a new instruction in other to boost single and multi-threaded performance. Another important top-level design target included performance per mW and battery life.
The single-threaded performance got boosted by Intel Core class branch prediction, including six wide out of order instruction decode, four wide allocation, ten execution ports, Dual load/store pipelines. Tremont is a Quad-core module with L2 cache size modular between 1.5MB to 4.5MB for the four-core version. Robinson said that Intel expects that most customers will choose four cores version and larger sizes of cache.
Front end
The front-end Fetch and Predict comes with a long history, and it is 32 byte-based. The L1 predictor comes with (no penalty) and Large L2 predictor.
Since this is an out of order core, the fetch takes care of 32KB instruction cache, 32 bytes per cycle, and supports up to eight outstanding misses.
The front end Decode supports a 6-wide X86 instruction decode. It takes care of dual 3-wide clusters out of order. The wide decode without the area of a uop cache and optional single cluster mode based on product targets.
Integer supports large entry out of order window (>200). The exact number is 208. There are six Parallel reservation stations and seven wide execution (3 ALU, 2 AGU, one jump, and one store data).
The vector execution supports Dual 128b AES units, four-cycle crypto acceleration with single instruction SHA256, four-cycle and Galois Field new instructions.
There are two parallel reservation stations and three execution ports (SIMD/AES/FMUL, SIMD/AES/FADD, and Store data)
There are two load and store pipelines in the memory subsystem. The microarchitecture supports 32KB data cache with three cycle load to use and 1024 entry second-level TLB that is shared between code and data.
Shared L2 between big and small core
The L2 is shared among one to four cores, and it is configurable between 1.5MB to 4.5MB. The last level cache comes with inclusive and non-inclusive support. The memory subsystem supports Intel resource directory technology with L2 QoS, including code and data prioritization, LLC QoS, and Memory bandwidth enforcement.
It is important to mention that Sunny cove big cores, and Tremont small cores are plugging into same L2 cache. The L2 cache is designed to make the transitions from small to big cores smooth. The Lakefield is designed to offer reliable performance to consumers, cloud customers as well as embedded.
Instructions
The new instructions and technologies new to this 10nm focused microarchitecture include Accelerator interfacing Instructions with Efficient and scalable work-dispatch and synchronization to accelerators, Intel Speed Shift designed to Improve responsiveness with faster hardware controlled frequency changes. Additional two include CPU rooted secure boot with Intel Trusted Execution Technology and Intel Boot Guard and last but not least Intel total memory encryption designed to Improve confidentiality protection in memory from physical attacks.
Let me spend more time on Intel speed shift, Robinson mentions that this is a feature from Coffee and Kaby Lake. The feature will help hardware change the CPU clocks faster than the Operating system. The feature is designed to wake the system up from sleep faster. Ice Lake probably supports this, too, but Robinson didn’t have an answer for us, as he was focused on the Tremont design.
The silicon is out, and the microarchitecture will be used in different markets. Since Tremont supports some network-specific instructions there is a possibility to see it deployed in data center and 5G suitable workloads. Since this was not an SoC announcement Robinson didn’t want to comment if there will be a successor of Gemini Lake coming with these 10nm cores, but our friends at Wikichip have listed Tremont microarchitecture as a part of Skyhawk Lake in 2020.
Now the best comes at the end. The performance of Tremont should be 30 to 40 percent faster than the Goldmont plus cores. We were reminded that Goldmont plus ended up on average so much faster than the previous Goldmont core.
The relative single-threaded performance of Tremont measured in SPECint Rate Base 2006/2017 & SPECfp Rate Base 2006/2017 shows the range from 10+ to close to 80 percent faster performance relative to Goldmont plus cores.
Chief architect mentioned that Intel had waited to release small, big core architecture until it managed to get the right single and multi-threaded performance form the small cores. There is a foveros component of being able to package small and big cores on a small area, but this is something that exceeds the microarchitecture talks.
Microsoft might be the first announcing support for Tremont and Lakefield, but there will be more for sure. From where we stand, the Tremont looks very promising.