ストレージ

 See Application Note on accelerating Lempel Ziv compression by over 5x!'

The explosion of flash memories from simple portable "thumb" drives to low power, high performance Enterprise servers has happened very quickly and is still changing.

The unique capabilities of Tensilica processors has proven to be ideal for this market - where there are few standards yet lots of innovation for solving the write amplification, energy and error correction problems as well as increasing IOPS.

Our licensees are shipping Xtensa-based products that lead the industry. With more efficient processing logic and data I/O for their particular product or product line they can increase IOPS with fewer gates and consume less energy. No other processor can offer this.

See the customers that use Tensilica cores in Storage

Where can Tensilica processors be used in your flash controller?


Cryptography

  • AES-XTS up to 265x faster for 35kGates
  • Triple-DES up to 50x faster for 5kGates
  • SHA-1 up to 12x faster for 33kGates
  • ... compared to general purpose processors.

Cyclic Redundancy Check (CRC)

  • Up to 12x faster at 8bits per cycle for 3kGates
  • Up to 24x faster at 16bits per cycle for 3kGates
  • ... compared to general purpose processors.

Lempel Ziv Compression

  • ~5.5x faster for <17kGates
  • ... compared to general purpose processors.

LDPC Error Correction

  • Software programmable for flexibility
  • Shorter development and easier maintenance
  • Similar in size to RTL

Custom/Proprietary acceleration

  • Customer algorithms can be accelerated
  • No-one else will have the same acceleration

Linked List Search

  • 3x faster for 1 key match in <200Gates
  • 4x-6x faster for 3 key matches
  • ... compared to general purpose processors.

Host Protocol processing

  • Multiple protocol support
  • Single cycle per header
  • Any width up to 1024 bits
  • Initiate Data DMA
  • More processing is available if needed

Table Lookup

  • 7x faster for <1kGates
  • ... compared to general purpose processors.

Virtually Unlimited Bandwidth

When considering a processor for any design, the over-all suitability for the task has to consider getting data in to it for processing and then out to the rest of the system to have effect.

Conventional processors connect to the rest of the system via a system bus (32 to 128bits wide) and maybe an inbound DMA port. This gives a upper bound on the amount of data that the processor can operate on - to consume/produce more data the high-bandwidth operations are either offloaded or more processors are added and the task split across them. It all adds up to more development time, risk and energy consumtion.

Xtensa DPUs fundamentally gives the designer the ability to add multiple data ports to the processor, each up to 1024 bits wide - as well as the registers to hold and process them internally. Typically we'll see a few ports up to 256bit wide in designs that either take inputs directly from one part of the system (RTL/processor) or provide processed results to another part (RTL/processor) - see the diagram below:

Flexible, wide I/O

The system bus is still there, of course, but there are other ways to get the large amounts of data required in flash controllers into and out of the processor.

Over-all, this increases IOPS and reduces energy consumption by causing fewer transactions to occur along with avoiding the need to add more processors/offload engines.

Higher Performance with Lower Energy Consumption

Adding logic gates to address specific bottlenecks in any algorithm reduces cycle counts and naturally leads to lower energy consumption when the cycle count reduction outweighs the extra energy consumed by those additional gates.

To demonstrate, consider a set of common functions performed by Flash controllers - either on a processor or by an offload accelerator - indicated in the charts below:

This Performance chart shows the performance increase compared to Tensilica's high performance Diamond Stantards 570T real-time controller. The 570T is a 3-issue VLIW CPU that can sustain up to 3 RISC operations per cycle, making it competitive with other leading high-end real-time control CPUs.

The gates used to accelerate the performance were added to the 570T as new instructions using the TIE (verilog-like) language.

There are 2 implementation options for CRC16:

  1. "Hash" using a hash table. This takes fewer cycles but requires additional memory to store the hash table. It's typically used when run on a processor.
  2. "NoHash" is logic-only, typically used in offload accelerator implementations


This does not show the number of gates used to get the indicated performance gains - that is built into the Energy chart below...

This Energy chart shows an energy consumption comparison for the same algorithms in the Performance chart above.


EnergyConsumption ∝ GateCount / TotalCycles


The "Reference Energy" column on the far left is a reference point showing 20% of the total height for each of the 5 algorithms being shown.

The "Xtensa Energy" column on the right shows the reduction in energy for each algorithm. The net reduction shown over all algorithms would only occur in a design if each algorithm was actually taking 20% of the total cycles to begin with. As all controller architectures are different, once you know how many processor cycles these take in your own design you can assess how these accelerations may improve over-all performance there.

Differentiation & Scalability

Our processors are being chosen for use in low cost designs in the consumer space all the way up to high end multi-core designs for the demanding Enterprise market. Differentiation is made easier and lower cost for each customer as a natural part of product development.

Performance

Typically, adding a few hundred gates (as new instructions) can increase the performance of an algorithm by factors of 5x or 10x without noticeably increasing the power consumption. So, when more performance is required and it's not possible (or desirable) to increase the clock speed any further, adding a few critical instructions can reduce the number of cycles required - avoiding the need to add more processors or even create an offload accelerator.

When it comes to adding much higher levels of performance to support increased data rates and multiple channels, then multiple processors and accelerators are typically added. Often the processors are dedicated to a few types of task that each have different characteristics and benefit from instruction-level optimizations for highest efficiency. There are general control tasks as well as data operations that are specific to parts of each customer's design and not efficient on general purpose processors. MORE HERE - NEED TO MAKE POINT

Algorithmic

Most processors operate natively on 32bit quantities. If the algorithm's datatypes are shorter then some additional processing is often required to shift & mask before computation. This may add one or two cycles.

If the datatypes are wider than 32 bits then, typically, a function call is required to handle all the additional operations - this can take tens of cycles.

Xtensa gives designers the flexibility to add registers and instructions that operate on the exact data size that required in a single cycle. Increasing resolution/accurancy by using more bits in the future is easily accommodated by expanding the register width and updating the instruction itself - the instruction opcode used doesn't even need to change!

I/O Throughput

Increasing data throughput caused by newer interface standards, multiple channels or competitive pressure can be met by:

  • Expanding the width of the existing ports
  • Adding more I/O ports to spread the load across multiple processing units

Xtensa gives you both options, expanding the width of the ports up to 1024 bits each and adding more of them.

Product Line

OEMs often provide products in multiple markets that have different requirements or care-abouts. Consumer markets tend to require lower cost products whereas enterprise markets look for higher IOPS and reliability.

Much of the firmware that is developed can be re-used in all markets, so a common processor architecture is desirable from both a Hardware and Software development perspective.

Using processor architectures that have fixed performance/power/area points typically compromises each design with over-ability - the next processor in the range has to be chosen, even if it's 50% more capable than it needs to be. This leads to extra cost and energy consumption compared to a processor that is just sufficient for the design.

Designers using Xtensa can customise their processor(s) to do just what is needed. Any differentiating customizations are only known to the OEM and can be re-used in other Xtensa processors. This makes Xtensa-based processors ideal for use across all solid state controller designs, where OEMs must differentiate.

Learn more details about using Tensilica in your storage products

Please look at the PDFs and workspaces below:

Title File Size Last Modified


Other optimizations may also be available, please contact your local sales representative for more information.

 


Did You Know?

Did You Know?

Tensilica is the largest privately held semiconductor IP licensor?

©2017 Tensilica Inc. All rights reserved.