Design and Implementation of Signal Processing Systems: An Introduction

Outline • What is signal processing?• Implementation Options and Design issues: – General purpose (micro) processor (GPP) • Multimedia enhanced extension (Native signal processing) – Programmable digital signal processors (PDSP) • Multimedia signal processors (MSP) – Application specific integrated circuit (ASIC)– Re-configurable signal processors 2

Issues in DSP Architectures and Projects • Provide students with a global view of embedded micro-architecture implementation options and design methodologies for multimedia signal processing. • The interaction between the algorithm formulation and the underlying architecture that implements the algorithm will be focused: – Formulate algorithm that matches the architecture.– Design novel architecture to match algorithm.

Issues to be treated in projects • Signal processing computing • Native signal processing and algorithms multimedia extension • Algorithm representations • Programmable DSPs • Algorithm transformations: • Very Long Instruction Word – Retiming, unfolding (VLIW) Architecture – Folding • Re-configurable computing & • Systolic array and design FPGA methodologies • Signal Processing arithmetics: • Mapping algorithms to array CORDIC, and distributed structures arithmetic. • Low power design • Applications: Video, audio, communication

What is Signal? • A SIGNAL is a measurement of a physical quantity of certain medium. • Examples of signals: – Visual patterns (written documents, picture, video, gesture, facial expression) – Audio patterns (voice, speech, music)– Change patterns of other physical quantities: temperature, EM wave, etc. • Signal contains INFORMATION!

Medium and Modality • Medium: – Physical materials that carry the signal.– Examples: paper (visual patterns, handwriting, etc.), Air (sound pressure, music, voice), various video displays (CRT, LCD) • Modality: – Different modes of signals over the same or different media. – Examples: voice, facial expression and gesture.

What is Signal Processing? • Ways to manipulate • Types of processing: signal: – Transformation– Filtering – in its original medium – Detection – or an abstract – Estimation representation. – Recognition and • classification Signal can be – Coding (compression) abstracted: – Synthesis and reproduction – as functions of time – Recording, archiving – or spatial coordinates. – Analyzing, modeling

Signal Processing Applications • Communications: – • Modulation/Demo Audio dulation (modem) – 3D sound, – – surround sound Channel estimation, • Speech equalization – Coding – – Recognition Channel coding – Synthesis – Source coding: – Translation compression • Virtual reality, animation, • Imaging: • Control – Digital camera, – Hard drive, – scanner – Motor – HDTV, DVD • Robotics and Intelligent Systems

Digital Signal Processing • Signals generated via • Digital signal processing physical phenomenon are concerns processing analog in that signals using digital – Their amplitudes are defined computers. over the range of – A continuous time/space real/complex numbers signal must be sampled to – Their domains are continuous yield countable signal in time or space. samples. • Processing analog signal – The real-(complex) valued requires dedicated, special samples must be hardware. quantized to fit into internal word length.

Signal Processing Systems D D iig g iitta a ll S S iig g n n a a ll D/A P P rro o c c e e ssssiin n g g A/D • The task of digital signal processing (DSP) is: – to process sampled signals (from A/D analog to digital converter), – and provide its output to the D/A (digital to analog converter) to be transformed back to physical signals.

Implementation of DSP Systems • Platforms: • Requirements: – Native signal processing – Real time (NSP) with general purpose • Processing must be done processors (GPP) before a pre-specified • Multimedia extension (MMX) deadline. instructions – – Streamed numerical data Programmable digital signal processors (PDSP) • Sequential processing • Media processors • Fast arithmetic processing – Application-Specific – High throughput Integrated Circuits (ASIC) • Fast data input/output – Re-configurable computing • Fast manipulation of data with field-programmable gate array (FPGA)

How Fast is Enough for DSP? • It depends! • Different throughput rates • Real time requirements: for processing different – Example: data capture speed signals must match sampling rate. – Throughput ∝ sampling rate. Otherwise, data will be lost. – CD music: 44.1 kHz – Example: in verbal – Speech: 8-22 kHz conversation, delay of – Video (depends on frame response can not exceed rate, frame size, etc.) range 50ms end-to-end. from 100s kHz to MHz. – Processing must be done by a specific deadline. – A constraint on throughput.

Early Signal Processing Systems • Implemented with • Key approach: either main frame – Faster hardware computer or special – Faster algorithms purpose computers. • Faster algorithms • Batch processing – Reduce the number of rather than real time, arithmetic operations streamed data – Reduce the number of bits to processing. represent each data • Accelerate processing – Most important example: Fast Fourier Transform speed is of main Fast Fourier Transfor concern.

Computing Fourier Transform Discrete Fourier Transform • Fast Fourier Transform N 1 n π k X (k) = ∑− x(n) − 2 exp[ ] – Reduce the computation to N n=0 O(N log N) complex 2 N −1 n π k x(n) = 1 2 X (k) exp[ ] multiplications ∑ N N k =0 – Makes it practical to process large amount of digital data. – • Many computations can be To compute the N frequencies “Speed-up” using FFT {X(k); 0 ≤ k ≤ N−1} requires – Dawn of modern digital N2 complex multiplications signal processing

Evolution of Micro-Processor • Clock frequency increases • Micro-processors from 100KHz to 1GHz implemented a central • Number of transistors processing unit on a increases from 1K to 50M single chip. • Power consumption • Performance improved increases much slower with from 1MFLOP (1983) to the use of lower supply 1GFLOP or above voltage: 5 V drops to 1.5V • Word length (# bits for register, data bus, addr. Space, etc) increases from 4 bits to 64 bits today.

Native Signal Processing • MMX (multimedia extension General purpose instructions): special instructions for accelerating multimedia tasks. • Use GPP to perform signal • May share the same data-path processing task with no additional with other instructions, hardware. – or work on special hardware – Example: soft-modem, soft DVD modules. player, soft MPEG player. • Make use sub-word parallelism • Reduce hardware cost! to improve numerical calculation • May not be feasible for extremely speed. high throughput tasks. • Implement DSP-specific • It is interfering with other tasks arithmetic operations, eg. because GPP is tied up with NSP Saturation arithmetic tasks. operations.

ASIC: Application Specific ICs • Custom or semi-custom IC • ASIC becomes popular due chip or chip sets developed to availability of IC foundry for specific functions. services. • Suitable for high volume, • Fab-less design houses turn innovative design into low cost productions. profitable chip sets using • Examples: MPEG codec, CAD tools. 3D graphic chip, etc. • Design automation is a key enabling technology to facilitate fast design cycle and shorter time to market delay.

Programmable Digital Signal Processors (PDSPs) • Micro-processors designed • PDSPs were developed for signal processing to fill a market segment applications. between GPP and ASIC: • Special hardware support for: – GPP flexible, but slow – Multiply-and-Accumulate – ASIC fast, but inflexible (MAC) ops – Saturation arithmetic ops • As VLSI technology – Zero-overhead loop ops improves, role of PDSP – Dedicated data I/O ports changed over time. – Complex address calculation and memory access – Cost: design, sales, – Real time clock and other maintenance/upgrade embedded processing supports. – Performance

Multimedia Signal Processors • Specialized PDSPs • Main applications: designed for multimedia – Video signal processing, applications MPEG, H.324, H.263, • Features: etc. – Multi-processing system with – 3D surround sound a GPP core plus multiple – Graphic engine for 3D function modules rendering – VLIW-like instructions to promote instruction level parallelism (ILP) – Dedicated I/O and memory management units.

Re-configurable Computing using FPGA • FPGA (Field programmable gate array) is a derivative of PLD • Use of FPGA (programmable logic devices). • – They are hardware configurable Rapid prototyping: run to behave differently for fractional ASIC speed different configurations. without fabrication delay. • – Slower than ASIC, but faster Hardware accelerator: using than PDSP. the same hardware to realize different function modules to • Once configured, it behaves save hardware like an ASIC module. – Low quantity system deployment

Characteristics and Impact of VLSI • The term VLSI (Very Large • Characteristics Scale Integration) is coined in – High density: late 1970s. • Reduced feature size: 0.25µm • Usage of VLSI: -> 0.16 µm • – % of wire/routing area Micro-processor increases • General purpose – Low power/high speed: • Programmable DSP • Decreased operating voltage: • Embedded µ-controller 1.8V -> 1V – Application-specific ICs • Increased clock frequency: – Field-Programmable Gate 500 MHz-> 1GH. Array (FPGA) – High complexity: • Impacts: • Increased transistor count: 10M transistors and higher – Design methodology • Shortened time-to-market – Performance delay: 6-12 months – Power

Design Issues • Software design: • Given a DSP application, – NSP/MMX, PDSP/MSP which implementation – Algorithms are implemented as option should be chosen? programs. • For a particular – Often still require programming implementation option, how in assembly level manually to achieve optimal design? • Hardware design: – ASIC, FPGA • Optimal in terms of what – Algorithms are directly criteria? implemented in hardware modules. • S/H Co-design: System level design methodology.

Design Process Model • Design is the process that • Implementation links algorithm to – Assignment: Each operation implementation can be realized with • Algorithm • One or more instructions (software) – Operations • One or more function modules – Dependency between (hardware) operations determines a – Scheduling: Dependence partial ordering of execution relations and resource – Can be specified as a constraints leads to a dependence graph schedule.

A Design Example … Consider the algorithm: • Operations: – Multiplication n – Addition y = ∑a(k)x(k) • Dependency k =1 – Program: y(k) depends on y(k-1) – Dependence Graph: y(0) = 0For k = 1 to n Do a(1) x(1) a(2) x(2) a(n) x(n) y(k) = y(k-1)+ a(k)*x(k)End * * * y = y(n) y(0) + + + y(n)

Design Example cont’d … • Software • Hardware Implementation: Implementation: – Map each * op. to a multiplier, – Map each * operation to a and each + op. to an adder. MUL instruction. – Interconnect them according to – Map each + operation to a the dependence graph: ADD instruction. – Allocate memory space for {a(k)}, {x(k)}, and {y(k)} – Schedule the operation by a(1) x(1) a(2) x(2) a(n) x(n) sequentially execute y(1)=a(1)*x(1), y(2)=y(1) + * * * a(2)*x(2), etc. y(0) – y(n) Note that each instruction is + + + still to be implemented in hardware.

Observations • Bottom line – • Eventually, an Hardware/ software co- implementation is design. There is a realized with hardware. continuation between • However, by using the hardware and software same hardware to realize implementation. different operations at different time • A design must explore (scheduling), we have a both simultaneously to software program! achieve best performance/cost trade- off.

Designer has two approaches! • 1. Matching hardware to • 2. Formulate algorithm to algorithm match hardware – Hardware architecture must – Algorithm must be formulated so match the characteristics of that they can best exploit the potential of architecture. the algorithm. – Example: – Example: • GPP, PDSP architectures are • ASIC architecture is designed fixed. to implement a specific • One must formulate the algorithm algorithm, properly to achieve best performance. • and hence can achieve • superior performance. Eg. To minimize number of operations.

Algorithm Reformulation • Matching algorithm to architectural features – Similar to optimizing assembly code– Exploiting equivalence between different operations • Reformulation methods – Equivalent ordering of execution: • (a+b)+c = a+(b+c) – Equivalent operation with a particular representation: • a*2 is the same as left-shift a by 1 bit in binary representation – Algorithmic level equivalence • Different filter structures implementing the same specification!

Algorithm Reformulation (2) • Exploiting parallelism – Regular iterative algorithms and loop reformulation • Well studied in parallel compiler technology – Signal flow/Data flow representation • Suitable for specification of pipelined parallelism

Mapping Algorithm to Architecture • Scheduling and Assignment Problem – Resources: hardware modules, and time slots– Demands: operations (algorithm), and throughput • Constrained optimization problem – Minimize resources (objective function) to meet demands (constraints) • For regular iterative algorithms and regular processor arrays –> algebraic mapping. 15

Mapping Algorithms to Architectures • Irregular multi-processor architecture: – linear programming– Heuristic methods– Algorithm reformulation for recursions. • Instruction level parallelism – MMX instruction programming– Related to optimizing compilation.

Arithmetic • CORDIC – Compute elementary functions • Distributed arithmetic – ROM based implementation • Redundant representation – eliminate carry propagation • Residue number system 14

Low Power Design is important in DSP • Device level low power design• Logic level low power design• Architectural level low power design • Algorithmic level low power design

What is an LFSR & MISR circuit? • LFSR & MISR (Linear Feedback Shift Register & Multiple Input Signature Register) circuits are two types of a specially connected series of flip flops with some form of XOR/XNOR feedback. • They are used in many applications for the generation or detection of Pseudo Random Sequences.

LFSR Block Diagram Generic LFSR Feedback In Out D1 Q1 D2 Q2 D3 Q3 D4 Q4 Clk

LFSR Block Diagram (cont.) By Changing the Feedback path to “tap” only certain FF’s, a Maximal Length Sequence can be produced. Maximal Length LFSR (n = 4) Feedback In Out D1 Q1 D2 Q2 D3 Q3 D4 Q4 Clk Polynomial: 1 + x3 + x4 Maximal Length: (2n – 1) = (24 – 1) = (16 – 1) = 15

Problems with this type of LFSR Generic LFSR Feedback Out In D1 Q1 D2 Q2 D3 Q3 D4 Q4 Clk • Setup Time – Feedback for D1 has to go through N XORs before arriving. N Logic delays slows down circuit performance (may need to run “at speed”). • Solution is to have many-input XOR feeding D1 input (1 logic level). • State 000 is illegal. When FFs power up, they must be initialized with valid data. Solution is to use XNORs instead. Still produces a PRBS but all zeros is a valid state.

Maximal Length Sequence State FF 1 FF 2 FF 3 FF 4 Feedback S0 0 0 0 1 S1 1 0 0 0 S2 0 1 0 0 S3 0 0 1 0 In Out S4 1 0 0 1 D1 Q1 D2 Q2 D3 Q3 D4 Q4 S5 1 1 0 0 S6 0 1 1 0 S7 1 0 1 1 S8 0 1 0 1 Clk S9 1 0 1 0 S10 1 1 0 1 S11 1 1 1 0 S12 1 1 1 1 S13 0 1 1 1 Output Sequence: S14 0 0 1 1 S15=S0 0 0 0 1 100010011010111,10001… S16=S1 1 0 0 0

MISR Block Diagram Generic MISR D1 D2 D3 D4 Out D1 Q1 D2 Q2 D3 Q3 D4 Q4 Feedback Multiple Inputs (4-bit wide): {D1,D2,D3,D4}

LFSR & MISR Applications: • BIST (Built-in Self Test) of logic devices.• Cyclic Encoding/Decoding (Cyclic Redundancy Check) • Pseudo Noise Generator• Pseudo Random Binary Sequence Generator• Spread Spectrum (CDMA) applications

Built-In Self Test (BIST) • Devices can be self-tested (at speed) by incorporating LFSR and MISR circuits into the design. Testing can occur while the device is operating or while in an idle mode. • An LFSR generates a Pseudo-Random Test Pattern. A small LFSR with the appropriate feedback can generate very long sequences of apparently random data.

Built-In Self Test (BIST) (cont.) • The Pseudo-Random pattern that is generated by the LFSR is feed through the logic under test then into the MISR. – The MISR will essentially compare the result with a known “good” signature. – If the result is the same, then there were no errors in the logic. • Refer to Dr. Perkowski’s Built-In Self Test Presentation in Test Class for more information.

Spread Spectrum PRBS • Because PN signals have good auto-correlation, they are used in Code Division Multiple Access Spread Spectrum Communication Systems. • Pseudo Random Noise Sequences are used to effectively “spread” the overall bandwidth of a CDMA signal. • For every data bit that is to be transmitted, a PRNS is substituted. The Information rate remains the same, but the new bit rate is dramatically increased. 1 -> 100010011010111…0 -> 011101100101000…

Spread Spectrum PRBS (cont.) • Below is a diagram showing an efficient arbitrary PRBS generator.• By modifying Tap_config[0:3] and selecting the proper output, this circuit can generate many different Pseudo Random Binary Sequences. Tap_config[0:3] D1 Q1 D2 Q2 D3 Q3 D4 Q4 Clk Out_sel[0:1] 0 1 2 3 Out

Practical LFSR and MISR circuits • LFSR and MISR circuits are used in many applications. • As technology continues to advance, more and more devices will be developed that will utilize the unique properties of these powerful circuits. • Built-In Self Test and Spread Spectrum (CDMA) applications are but a few of the many places where LFSR and MISR circuits are used.

Practical Combinational Multipliers

What is a combinational multiplier? • A combinational multiplier circuit is comprised of multiple shift registers, an adder, and some control logic. • A multiply is performed by addition and shifting. • Typical generic multipliers are slow, often taking multiple clock cycles to computer a product. • Computers without dedicated multipliers must perform a multiply using this method.

Example: 4-bit Multiply 2 ‘ s C o m p l e m e n t a 0 b 1 a 0 b 3 a 0 b 0 a 0 b 2 1 1 0 1 H Aa 1 Hb 2 Aa 1 b H1 Aa 1 b 0 x 0 1 1 1 0 1H A F A F A a 2 b 2 a 2 b 1 a 2 b 0 1 1 0 1H A a 3 bF 2 Aa 3 b F1 A 1 1 0 1 a 3 b 0 0 0 H A F A F A – - -a 3- b -3 -a 2 -b 3 a 1 b 3 0 1 0 1 1 0 1 1 c 7 c 6 c 5 c 4 c 3 c 2 c 1 c 0 P r o d u c t T e r m s FA= Full Add HA=Half Add

Generic Serial Multiplier Block Diagram Digital Systems Principals and Applications, Ronald J. Tocci, Prentice Hall 1995, pg 280

So what’s wrong with this type of multiplier? • For an N x N generic Multiplier, it takes N clock cycles to get a product. That’s too slow! • Inefficient use of hardware.

Types of Multipliers • Standard Binary Multiplier (ones complement, twos complement, universal, etc…) • Re-coded Multipliers (Canonical Signed Digit, Booth, etc…)• Serial / Parallel Multipliers• Iterative Cellular Array Multipliers (Wallace, Pezaris, Baugh-Wooley, etc…) • ROM based Multiplication Networks (Constant Coefficient Multipliers, Logarithmic, etc…)

Multiplier Applications • General Purpose Computing • Digital Signal Processing – Finite Impulse Response Filters– Convolution

ROM Based Constant Coefficient Multiplier • With some DSP applications, such as FIR filter generation and convolution, where the coefficients remain unchanged and high speed is a requirement, using a look-up table approach to multiplication is quite common. • Using the known coefficients, every possible product is calculated and programmed into a look-up table. (ROM or RAM) • The unknown multiplicand (input data) is used as an address to “look up” the product. • This method results in very high speed multiplies, however it requires large amounts of storage space.

ROM Based Constant Coefficient Multiplier (cont.) • Uses ROM to generate partial product• Sum all partial product ROM outputs C o n s t a n t C o e f f i c i e n t M u l t i p l i e r ( K C M ) R O M L o o k – U p T a b l e 0 1 k2 k3 k 1 2 4 . 1 6 . 0 0 0 0 1 5 k A x [ 7 : 0 ] D Y [ 1 5 : 0 ] 1 6 8 R O M D L o o k – U p T a b l 0e 0 0 0 0 1 k 1 6 2 k 1 2 4 3 k .. 1 5 k

Practical Combinatorial Multipliers • Generic Shift/Add type multipliers are SLOW! • People will always be searching for methods of performing faster multiplies. • Multipliers are used in many areas. • General purpose math for PCs and DSP (FIR filters, Convolution, etc…) applications are just a few of the places were multipliers are utilized.

References • Digital Systems Principals and Applications, Ronald J. Tocci, Prentice Hall 1995, pg 278-282 • Xilinx Application Note (XAPP 054). Constant Coefficient Multipliers for XC4000E. http://www.xilinx.com/xapp/xapp054.pdf • Altera Application Note (AN 82). Highly Optimized 2-D convolvers in FLEX Devices. http://www.altera.com/document/an/an082_01.pdf • Computer Arithmetic Principles, Architecture, and Design, Kai Hwang, John Wiley & Sons, Inc. 1979, pg129-212

References • Dr. Perkowski. Design for Testability Techniques (Built-In Self-Test) presentation. http://www.ee.pdx.edu/~mperkows/CLASS_TEST_99/BIST.PDF • Digital Communications Fundamentals and Applications, Bernard Sklar, Prentice Hall 1988, Pg 290-296, Pg 546-555 • Xilinx Application Note (XAPP 052). Efficient Shift Registers, LFSR Counters, and Long Pseudo-Random Sequence Generators. http://www.xilinx.com/xapp/xapp052.pdf • Sun Microsystems’ sponsored EDAcafe.com website. Chapter 14 – Test. http://www.dacafe.com/Book/CH14/CH14.htm

Sources •Yu Hen Hu •Andrew Iverson, ECE 572