Fixed vs. Floating Point
How to ease the pains of floating- to fixed-point conversion when porting an algorithm to a real time embedded FPGA or ASIC hardware accelerator.
The Fixed Point / Floating Point Conflict
Scientific algorithms are typically developed in high level languages such as Matlab or C. Single- or double-precision floating point data structures are used to support wide dynamic range while maintaining data precision. Indeed, issues of truncation, rounding and overflows would seem frivolous at a time where the validity of the algorithm is the utmost concern.
However, full floating point mathematics are often unsuitable for implementation in FPGA or ASIC. Floating point operators generally have dramatically increased logic utilization and power consumption, combined with lower clock speed, longer pipelines, and reduced throughput capabilities when compared to integer or fixed point. Instead of just throwing more or larger logic devices at the problem, companies are usually faced with the non-trivial task of converting all or some of the algorithm to fixed point.
This article sets up the issues that cause the conflict, and ties together the pertinent Dillon Engineering solutions.
Disconnect between disciplines
Depending on the company, a chasm of varying width may exist between the scientists/DSP gurus/system engineers/algorithm developers and the hardware/embedded software/implementation engineers. Even if the algorithm developers have fixed point capability in their software, or if the embedded developers have high level language proficiency, there may not be a good link between the tools or between the engineering disciplines.
The result can be poorly documented, ad hoc implemented, minimally verified implementations that stand the risk of hard-to-identify lab or field failures. Clearly what is needed is a process that follows the design all the way through from early analyses to the eventual hardware implementation.
Working with requirements and constraints
When working toward the optimal hardware solution, there are two factors pitted against each other:
- Algorithm adherence to functional requirements
- Hardware adherence to design constraints
From the top-down, the algorithm must still achieve acceptable results, no matter what data restrictions or restructurings are imposed. From the bottom-up, the hardware must be realizable considering the following factors:
- Physical constraints, such as size, weight, power, and design margins
- Performance requirements, such as real-time or line-rate throughput
- Recurring hardware and non-recurring development cost
The interactions of these 2nd- and 3rd-order tradeoffs are where the analyses begin to get interesting.
Analyses for floating point to fixed point conversion (FFC)
The initial analysis of the hardware implementation will likely be based on full floating point, then move on to look at FFC to achieve hardware requirement conformance or optimization. Perhaps the algorithm port can remain all in floating point, take an acceptable effort to implement, and satisfy all of the functional and non-functional requirements. Even if this is the case, some FFC investigation may be beneficial because power, logic size and recurring cost could be reduced.
The following are example stages of analysis typically encountered when considering FFC.
Hardware utilization predictions
In order to predict floating- and fixed-point logic utilization, one must be able to dissect the algorithm into the building blocks that can be characterized in terms of the hardware constraints. At DE, we are able to quickly arrive at a base architecture using a literal sequential flow-through analysis of the original algorithm, then use data from our IP Core Library and existing design database to identify the building blocks. For custom IP, our ParaCore Architect core auto-generation capabilities allow us to quickly characterize new flavors. The result will be more accurate utilization predictions for the base and subsequent architectures.
Hardware performance predictions
Predicting hardware performance in algorithm logic accelerators goes beyond simple clock frequency parameters. For understanding fixed- and floating-point pipelined building block performance, it is necessary to determine from the algorithm where potential bottlenecks may exist in hardware. Design experience goes a long way here, using tools such as spreadsheets and flow diagrams to analyze throughputs at microscopic and macroscopic levels.
In order to take advantage of logic device operation and avoid bottlenecks, straight-through pipelined concurrent processing may not be enough. Parallelization and resource-sharing modifications should be considered to improve performance, reduce device utilization, or both. A deep design knowledge base of the previous topics of hardware predictions is required to know the structure and impacts of such optimizations.
Fixed point insertion
Once the need for some FFC has been identified, the question becomes where will it make the most sense to convert. Keeping data that is natively in fixed point, such as right off an ADC, is a logical starting point. Larger functions like FFTs generally have correspondingly large differences in logic utilized between fixed and float. Once again, having the deep knowledge base of the low-level structures streamlines the trade-off process. But no matter where it seems FFC would provide the greatest hardware utilization relief, the top-down issue of maintaining algorithm functional requirements now comes into play.
Fixed point data quality
After the hotspots with potential FFC benefits have been identified, it must be determined if the algorithm intent will still be met. Trying to do this by hand analysis or HDL simulation can be a long and arduous process. For complex algorithms, there may be hundreds of potential FFC opportunities. This is where full floating- and fixed-point math modeling reaps huge benefits. Having a model that is easy to tweak, fast to simulate and bit-true with the logic is perhaps the most important productivity gain in the FFC process.
No matter what FFC is implemented, if a weak data set is used for stimulus and results are not properly analyzed, the algorithm functional requirements will never be proven. Once again, enter a high-level language environment for quickly and easily injecting typical and worst-case stimulus based on built-ins such as trig functions and file I/O, and for comparing results using common metrics such as signal-to-noise ratio. The more stress testing is done in a high-level language, the more productivity is gained in FFC applications.
Putting it together with the capabilites developed at Dillon Engineering
Unless the problem is especially simple or obvious, the FFC process will likely be an iterative one. As such, reduced time and effort on the passes through the loop will pay compounded dividends. Combine this with the confidence that the hardware accelerator will meet every functional and non-functional requirement for its service life, and it is obvious that successful FFC will require all pieces of the puzzle to work together. For more details on how DE answers to all of these topics, see our following pages: