The explosion of flash memories from simple portable "thumb" drives to low power, high performance Enterprise servers has happened very quickly and is still changing.
The unique capabilities of Tensilica processors has proven to be ideal for this market - where there are few standards yet lots of innovation for solving the write amplification, energy and error correction problems as well as increasing IOPS.
Our licensees are shipping Xtensa-based products that lead the industry. With more efficient processing logic and data I/O for their particular product or product line they can increase IOPS with fewer gates and consume less energy. No other processor can offer this.
Where can Tensilica processors be used in your flash controller?
- AES-XTS up to 265x faster for 35kGates
- Triple-DES up to 50x faster for 5kGates
- SHA-1 up to 12x faster for 33kGates ... compared to general purpose processors.
Cyclic Redundancy Check (CRC)
- Up to 12x faster at 8bits per cycle for 3kGates
- Up to 24x faster at 16bits per cycle for 3kGates ... compared to general purpose processors.
Lempel Ziv Compression
- ~5.5x faster for <17kGates ... compared to general purpose processors.
LDPC Error Correction
- Software programmable for flexibility
- Shorter development and easier maintenance
- Similar in size to RTL
- Customer algorithms can be accelerated
- No-one else will have the same acceleration
Host Protocol processing
- Multiple protocol support
- Single cycle per header
- Any width up to 1024 bits
- Initiate Data DMA
- More processing is available if needed
- 7x faster for <1kGates ... compared to general purpose processors.
Trends in SSD Controller Design
Flash controller designs are always changing in response to:
- Increasing data throughput at the host interface
- Larger memory capacity at the flash interface
- Increasing ECC requirements from smaller geometries and multi-level-cells
- Changing/developing interface protocols
- Better proprietary algorithms for more competitive products
Dedicated hardware have traditionally been used in SSD controler designs for high throughput tasks such as error correction and host/flash interface control and a generic controll CPU deals with housekeeping requirements such as block mapping, wear leveling, garbage collection and the differentiating operations.
As performance requirements increase with each new generation of controller design, one conventional processor is not able to keep up with the I/O bandwidth and new algorithm complexities. Initially, as design team might increase the clock speed of the CPU. Eventually, more processors are added after increasing the clock speed gets prohibitively costly in terms of die area and energy consumption.
Adding more processors to the design adds intercommunication complexity and, as conventional controller CPUs are typically not very efficient at implementing SSD-specific algorithms, the energy efficiency gets even worse. As a result, even more dedicated logic is added to offload some of the processor's most inefficeint functions such as table lookups and linked list searching.
Maintaining and upgrading the increasing amounts of dedicated logic is a large task due to the verification time required. Keeping these hardware accelerator blocks as programmable as possible reduces development time and risk, but the energy and die area limits restrict the ability to just add bigger or more processors.
One increasingly popular solution to the dilemma: Xtensa processors allow the designer to avoid the need to create and verify dedicated offload logic, instead the designer quickly captures the computational workload of the accelerator as a set of new instructions inside a Tensilica DPU - avoiding the need to create additional interfaces and state machines along with the long verification times. It's also programmable and easy to change.
SSD controllers powered by Xtensa DPUs are already shipping in large volumes, and more than a dozen new Xtensa-powered SSD controller designs are coming to the market in the next few years.
Virtually Unlimited Bandwidth
When considering a processor for any design, the over-all suitability for the task has to consider getting data in to it for processing and then out to the rest of the system to have effect.
Conventional processors connect to the rest of the system via a system bus (32 to 128bits wide) and maybe an inbound DMA port. This gives a upper bound on the amount of data that the processor can operate on - to consume/produce more data the high-bandwidth operations are either offloaded or more processors are added and the task split across them. It all adds up to more development time, risk and energy consumtion.
Xtensa DPUs fundamentally gives the designer the ability to add multiple data ports to the processor, each up to 1024 bits wide - as well as the registers to hold and process them internally. Typically we'll see a few ports up to 256bit wide in designs that either take inputs directly from one part of the system (RTL/processor) or provide processed results to another part (RTL/processor) - see the diagram below:
The system bus is still there, of course, but there are other ways to get the large amounts of data required in flash controllers into and out of the processor.
Over-all, this increases IOPS and reduces energy consumption by causing fewer transactions to occur along with avoiding the need to add more processors/offload engines.
Higher Performance with Lower Energy Consumption
Adding logic gates to address specific bottlenecks in any algorithm reduces cycle counts and naturally leads to lower energy consumption when the cycle count reduction outweighs the extra energy consumed by those additional gates.
To demonstrate, consider a set of common functions performed by Flash controllers - either on a processor or by an offload accelerator - indicated in the charts below:
This Performance chart shows the performance increase compared to Tensilica's high performance Diamond Stantards 570T real-time controller. The 570T is a 3-issue VLIW CPU that can sustain up to 3 RISC operations per cycle, making it competitive with other leading high-end real-time control CPUs.
The gates used to accelerate the performance were added to the 570T as new instructions using the TIE (verilog-like) language.
There are 2 implementation options for CRC16:
This does not show the number of gates used to get the indicated performance gains - that is built into the Energy chart below...
This Energy chart shows an energy consumption comparison for the same algorithms in the Performance chart above.
EnergyConsumption ∝ GateCount / TotalCycles
The "Reference Energy" column on the far left is a reference point showing 20% of the total height for each of the 5 algorithms being shown.
The "Xtensa Energy" column on the right shows the reduction in energy for each algorithm. The net reduction shown over all algorithms would only occur in a design if each algorithm was actually taking 20% of the total cycles to begin with. As all controller architectures are different, once you know how many processor cycles these take in your own design you can assess how these accelerations may improve over-all performance there.
Differentiation & Scalability
Our processors are being chosen for use in low cost designs in the consumer space all the way up to high end multi-core designs for the demanding Enterprise market. Differentiation is made easier and lower cost for each customer as a natural part of product development.
Typically, adding a few hundred gates (as new instructions) can increase the performance of an algorithm by factors of 5x or 10x without noticeably increasing the power consumption. So, when more performance is required and it's not possible (or desirable) to increase the clock speed any further, adding a few critical instructions can reduce the number of cycles required - avoiding the need to add more processors or even create an offload accelerator.
When it comes to adding much higher levels of performance to support increased data rates and multiple channels, then multiple processors and accelerators are typically added. Often the processors are dedicated to a few types of task that each have different characteristics and benefit from instruction-level optimizations for highest efficiency. There are general control tasks as well as data operations that are specific to parts of each customer's design and not efficient on general purpose processors. MORE HERE - NEED TO MAKE POINT
Most processors operate natively on 32bit quantities. If the algorithm's datatypes are shorter then some additional processing is often required to shift & mask before computation. This may add one or two cycles.
If the datatypes are wider than 32 bits then, typically, a function call is required to handle all the additional operations - this can take tens of cycles.
Xtensa gives designers the flexibility to add registers and instructions that operate on the exact data size that required in a single cycle. Increasing resolution/accurancy by using more bits in the future is easily accommodated by expanding the register width and updating the instruction itself - the instruction opcode used doesn't even need to change!
Increasing data throughput caused by newer interface standards, multiple channels or competitive pressure can be met by:
- Expanding the width of the existing ports
- Adding more I/O ports to spread the load across multiple processing units
Xtensa gives you both options, expanding the width of the ports up to 1024 bits each and adding more of them.
OEMs often provide products in multiple markets that have different requirements or care-abouts. Consumer markets tend to require lower cost products whereas enterprise markets look for higher IOPS and reliability.
Much of the firmware that is developed can be re-used in all markets, so a common processor architecture is desirable from both a Hardware and Software development perspective.
Using processor architectures that have fixed performance/power/area points typically compromises each design with over-ability - the next processor in the range has to be chosen, even if it's 50% more capable than it needs to be. This leads to extra cost and energy consumption compared to a processor that is just sufficient for the design.
Designers using Xtensa can customise their processor(s) to do just what is needed. Any differentiating customizations are only known to the OEM and can be re-used in other Xtensa processors. This makes Xtensa-based processors ideal for use across all solid state controller designs, where OEMs must differentiate.
Learn more details about using Tensilica in your storage products
Please look at the PDFs and workspaces below:
|Title||File Size||Last Modified|
Other optimizations may also be available, please contact your local sales representative for more information.