By Paul Sutton at the Xcelerit Blog
FPGAs are programmable hardware devices, traditionally used in the signal processing domain for real-time number-crunching where high performance and low power consumption are paramount. For financial services, their deterministic high performance and low latency makes FPGAs a perfect fit for high-frequency trading – and that’s where FPGAs are typically used in banks and hedge funds. However, the complexity to develop in VHDL or Verilog has been a major barrier for adoption in derivatives pricing and risk management applications, i.e. compute-intensive analytics. In this blog post, we will look into using FPGAs for this type of algorithms, using the OpenCL-enabled PCIe-385N FPGA board from Nallatech that we’ve just received. It features the powerful Altera Stratix V A7 FPGA. We’ve put it to the test using the example of a complex derivative pricing algorithm.
Nallatech PCIe-385N
This board comes in a PCI express form factor which can be plugged into workstations easily. It is small (Low Profile, Half Length PCIe) and low power (tens of watts). It can be configured with 8 or 16GB of memory, leaving enough head room for most financial applications. The board supports Altera’s SDK for OpenCL – allowing it to be programmed using higher-level software tools. This SDK automatically compiles and synthesizes the OpenCL kernel code into FPGA logic, creating deep parallel pipelines and adding the interfacing logic to control the execution via the host CPU.
Nallatech PCIe-385N FPGA Card with Altera Stratix V FPGA
Algorithm
As a test algorithm, we’ve used a Monte-Carlo LIBOR Swaption portfolio pricing algorithm. It prices a portfolio of 15 swaptions for the LIBOR rate, using thousands of Monte-Carlo paths. In each path, a potential future development of the LIBOR interest rate is simulated at 80 time steps, employing a LIBOR market model and using normally-distributed random numbers. For these, the value of the swaption portfolio is computed by applying a portfolio pay-off function. The overall value of the portfolio is then estimated by computing the mean across all paths. The equations for computing the LIBOR rates and pay-offs are given in Prof. Mike Giles’ notes (Oxford University). The algorithm is depicted by the dataflow graph below:
Test Setup
We’ve run the described algorithm on the Nallatech card with the core algorithm completely implemented on the FPGA. That is, the random number generation, path computation, and mean reduction is running in FPGA logic. The overall application is directed from software running on the host CPU. The FPGA uses single precision floating point in all computations.
The following test system was used:
- CPU: 2 Intel Xeon E5620 processors, 4 cores each
- Accelerator: Nallatech PCIe-385N with Stratix V A7 FPGA
- OS: RedHat Enterprise Linux 5.4 (64bit)
- RAM: 24GB
- FPGA Design Suite: Altera Quartus II 12.1 SP1, 64bit
- OpenCL SDK: Altera SDK for OpenCL version 12.1 beta
- Host Compiler: GCC 4.1
The FPGA resource utilisation is as follows:
Resource | Usage |
Logic Utilisation | 54% |
ALUTs Used | 27% |
Dedicated Logic Registers | 26% |
Memory Blocks | 91% |
DSP Blocks | 57% |
As can be seen, the design is dominated by memory blocks (used for storing temporary arrays and caching data), followed by DSP blocks (which perform the floating point calculations).
Performance
Note: These performance numbers are indicative only, as they have been recorded with a beta version of the first OpenCL SDK from Altera. The purpose of this blog post is to show that FPGAs have evolved from a specialist hardware domain into a platform that can be smoothly handled by a software developer without FPGA experience.
We’ve measured the computation times of the FPGA version and compared to a sequential CPU reference of the same code. The speedup factors of the FPGA-based computation vs. the sequential CPU reference have been computed, taking into account the full algorithm execution time. These are illustrated in the graph below for varying numbers of paths, and in the table that follows.
Paths | 4K | 16K | 64K | 256K | 1024K |
Speedup | 35.0x | 29.3x | 27.9x | 27.6x | 27.6x |
Discussion
These numbers clearly show that the FPGA delivers high performance – up to 35x faster than the sequential CPU implementation. And considering the fact that it consumes an order of magnitude less power than a server-grade CPU, this is even more impressive. The deterministic performance of FPGAs can also be seen in the speedups above: The execution time per Monte-Carlo path is almost exactly constant on the FPGA, while on the CPU this improves with more paths. Therefore we can see a slightly decreasing speedup curve in the figure above.
It should be noted that synthesising hardware designs, even if generated automatically, is a much more complex process than compiling a piece of conventional software – synthesis for the above example took nearly 4 hours. The process of fixing some code and testing it is therefore much more involved. If you are interested to learn more about efficient software development techniques for FPGAs, just drop us a line and we’ll get back to you with more details.
The opinions and writing contained in this article are of the author alone and do not necessarily represent those of HFTReview.com.
Related content
News: Argon Design Announces Groundbreaking Results for High Performance Trading with FPGA and x86 Technologies
25 September 2013 – Argon Design
Cambridge, UK – 25 September 2013 – Argon Design, a design services company specializing in complex digital systems has developed a high performance trading system…
Blog: In-Memory We Trust?
Steve Graves 31 August 2012
News: Azul Systems Launches Zing™ Platform Edition with WebSphere® Application Server
14 May 2013 – Azul Systems
New Integrated Solution Delivers Zing’s Unstoppable Java for WebSphere Deployments LONDON, UK, and SUNNYVALE, Calif., May 14, 2013 –Azul System…
Blog: Adopting the Right Approach to FPGA Implementation
Enyx 11 February 2013
News: Altera Announces Breakthrough Advantages with Generation 10
12 June 2013 – Altera
Stratix 10 FPGAs and SoCs leverage Intel’s 14 nm Tri-Gate process and an enhanced architecture to deliver core performance two times higher than current high-end FPG…
News: Impulse C Integration to Solarflare AOE Programmable NIC Shortens Development Time
6 February 2013 – Impulse Accelerated Technologies
New Impulse C to FPGA framework enables software developers to easily insert their own custom logic into Solarflare 10 Gbps programmable network interface card. Belle…
News: Volante Achieves Oracle Exalogic Optimized, Oracle Exadata Ready, and Oracle SuperCluster Ready Status
17 September 2013 – Volante Technologies
Volante Suite is Optimized for Speed and Reliability on Oracle Exalogic Elastic Cloud and Supported and Ready to Run on Oracle Exadata Database Machine and Oracle SuperCluster…
Blog: Destructive Destruction? An Ecological Study of High Frequency Trading
Bogdan Dragos 5 February 2013