Xilinx Alveo U50 adaptable compute is out

Published in AI

Xilinx Alveo U50 adaptable compute is out

by Fuad Abazovic on06 August 2019

font size decrease font size increase font size

Wins against Xeon and Tesla

Xilinx has a data center first strategy that Victor Peng and his team are executing flawlessly so far, and part of that plan is to launch an adaptable compute, network and storage card and call it Alveo U50.

The card sounds like a Swiss penknife that solves a lot of critical data center centric problems, and on top of software versatility, it also supports adaptable hardware with FGPGA brain. Jamon Bowen, Data Center Compute and Storage Segment Marketing Director for Xilinx, spent some time to talk about the U50 and its advantages over the competition.

The Alveo U50 is an industry-first low-profile adaptable accelerator with PCIe Gen 4 support managing to address a broad range of critical compute, network and storage workloads, all on one reconfigurable platform. U50 is still not Versal, this is still an UltraScale+ chip, but it is able to fulfill its need and win against modern Intel data center CPUs as well as Nvidia Tesla cards.

AlveoU50card

Xilinx’s secret weapon is incredibly low latency and extreme power efficiency and flexibility to deploy to either on-premise or cloud. Alveo U50 delivers between 10-20x improvements in throughput, latency, and power efficiency. For accelerated networking and storage workloads, the U50 card helps developers identify and eliminate latency and data movement bottlenecks by moving to compute closer to the data.

PCIe Gen 4, HBM 2 under 75W passive, half size

Apart from the UltraScale+ processor, the Alveo U50 card is the first half-height, half-length form factor, and low 75-Watt power envelope. Power is traditionally lower than 75W as this number represents the peak power for some extreme use cases. The U50 card packs 8GB HBM2 with 460 GB/s bandwidth and comes with QSFP28 interface capable of 100 Gbit Ethernet as well as IEEE 1588 clock precision. Alveo U50 is a PCIe Gen 3 16 or dual PCIe Gen4 x8 card that supports CCIX. The 75W TDP is achieved with a passive cooler.

The high speed I/O supports advanced applications such as NVM Express over Fabrics (NVMe -oF), disaggregated computational storage and specialized financial services applications.

Machine learning inference, video transcoding and data analytics to computational storage, electronic trading and financial risk modeling and some of the tasks where Alveo U50 brings programmability, flexibility, and high throughput and low latency performance advantages to any server deployment.

The hardware programmable part lets customers optimize application as workloads and algorithms continue to evolve.

Below are some of the examples Alveo U50 accelerated solutions manage to deliver significant customer value across a range of applications, including:

Deep learning inference acceleration (speech translation): delivers up to 25x lower latency, 10x higher throughput and significantly improved power efficiency per node compared to GPU-only for speech translation performance;
Data analytics acceleration (database query): running the TPC-H Query benchmark, Alveo U50 delivers 4x higher throughput per hour and reduced operational costs by 3x compared to in-memory CPU;
Computational storage acceleration (compression): delivers 20x more compression/decompression throughput, faster Hadoop and big data analytics, and over 30 percent lower cost per node compared to CPU-only nodes;
Network acceleration (electronic trading): delivers 20x lower latency and sub-500ns trading time compared to CPU-only latency of 10us;
Financial modeling (grid computing): running the Monte Carlo simulation, Alveo U50 delivers 7x greater power efficiency compared to GPU-only performance for a faster time to insight, deterministic latency, and reduced operational costs.

Chatting with Jamon Bowen caught us quite by surprise that in deep learning inference acceleration (speech translation) U50 can beat Nvidia Tesla T5 scoring 25 times lower latency and ten higher throughput. Below is an example of high throughput and low latency interface acceleration.

10X faster speech recognition over Tesla T4

Alveo runs two batches while Nvidia has to run eight batches of data, hence adding significantly to the overall latency. The score measured in Transformer NMT (Symbols/second) was ten times in Xilinx’s U50 favor compared to the Nvidia Tesla T4 GPU.

4.5X faster database analytics than Xeon Platinum 8260

Another example is high throughput query acceleration measured in queries per hour speedup. Intel Xeon Platinum 8260 processor with 35.75M Cache, 2.40 GHz 24 core was the 1X base. One Alveo U50 card was 4.5 times faster. Intel Xeon Platinum 8260 CPU query time was 210ms, scoring 34k queries per hour. Xilinx AlveoU50 has a latency of 24ms and managed to score 150k query per hour.

The best part is that Alveo U50 card performance scales linearly as adding the second Alveo U50 card would make Xilinx nine times faster than the Xeon Platinum 8260 and three Alveo U50 cards thirteen times faster than the CPU. To spice things up, Xeon Platinum has 165W TDP as much as two of Alveo U50 cards and change while scoring nine times faster.

Financial market modeling 20X faster than Intel 7X then Tesla

Financial market modeling is a good example where big banks and financial institutions using hyper-efficient derivative pricing and risk modeling to calculate the Monte Carlo simulation. More efficient time to insight is extremely important for the financial market. Alveo scores 20X more paths per seconds per watt compared to Intel Xeon E5-2697 v4GCC 5.4.0. Nvidia Tesla V100 16GB PCIe CUDA 10.1 / GCC 5.4.0 scores 3X better than the CPU, but still seven times worse than Xilinx AlveoU50 using SDAccel2018.3.

20X faster trading, low latency

Ultra-Low latency networking is something that electronic trade operations benefits from and Alveo U50 managed to achieve 20 times latency reduction over the CPU based solution. The overall T2T latency was less than 500 nm with deterministic performance. CPU based solution needs 10 microseconds for the same market data to TCP message.

20X faster GZIP de-compression acceleration per core

Compared to Intel Skylake-SP 6152 at 2.10GHz CPU running Ubuntu 16.04 Alveo U50 can offer 20X improvement in compression/decompression. The test shows results based on GB/s compression per code while Alveo U50 performance is estimated at stable 10GB/s and 2:1 compression.

AlveoU50chat

“Ever-growing demands on the data center are pushing existing infrastructure to its limit, driving the need for adaptable solutions that can optimize performance across a broad range of workloads and extend the lifecycle of existing infrastructure, ultimately reducing TCO”, said Salil Raje, executive vice president and general manager, Data Center Group, at Xilinx. “The new Alveo U50 brings an optimized form factor and unprecedented performance and adaptability to data center workloads, and we continue to build out solution stacks with a growing ecosystem of application partners to deliver previously unthinkable capabilities to a range of industries.”

AMD is excited about this product as it is the only CPU player in town supporting PCIe Gen 4 on the EPYC side and IBM seems to be very interested in the Alveo U50.

The Alveo U50 is sampling now with OEM system qualifications in due process. General availability is slated for fall 2019.

Last modified on 07 August 2019

Rate this item

(2 votes)

Tagged under