: (general-purpose, x86) , , , . , : , Digital Signal Processors DSP.

DSP , ( ) , . ARM-, DSP.

DSP, , , .

DSP 1970- . - , , ( - ). , ( MOSFET) (, ) , .. , , . , (time-to-market), , . .

Fig.  1 DSP's first major success: Speak & Spell tablet (Texas Instruments, 1978)
. 1 DSP: Speak&Spell (Texas Instruments, 1978)
Fig.  2 Since the advent of the GSM standard, DSPs have been an indispensable component of mobile networks.
Fig.  3 Image processing in cameras (debayering, noise removal, filtering) is also done on DSP (source: https://snapshot.canon-asia.com/india/article/en/5-things-made-possible-with-digic- image-processor)
. 3 (, , ) DSP (: https://snapshot.canon-asia.com/india/article/en/5-things-made-possible-with-digic-image-processor)

- DSP . - , .. , .


DSP , Intel Xeon Cortex-A, ? Intel.

Fig.  4 Intel Skylake (source: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client))
. 4 Intel Skylake (: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client) )

, , , (out-of-order speculative execution) (scheduling). , “” , .. , , 1%:

While a simple arithmetic operation requires around 0.5–20 pJ, modern cores spend about 2000 pJ to schedule it.

Conventional multicore processors consume 157–707 times more energy than customized hardware designs.

( “Rise and Fall of Dark Silicon”, ).

, Intel DSP Texas Instruments ( Skylake Xeon Platinum 8180M TMS320C6713BZDP300):


CPU (Intel)






560 GIPS

1.8 GIPS




$13K (+ )

$35 ( )

Target applications




/ /

0.097 GIPS//

1.7 GIPS//

( 17 Intel)

/ / / $

0.0075 MIPS///$

0.051 GIPS///$

( 7000 Intel)


, DSP , 1 4 (!) . : DSP : , , .. ( , ) .


DSP. Jennifer Eyre, BDTI, “ DSP , ” (“Architecture of DSP is molded by algorithms”, “Evolution of DSP Processors”). :

  • (ILP, Instruction Level Parallelism)

  • (, , )

  • ,

, .


  • (SIMD, Single Instruction Multiple Data)

  • (CISC, Complex Instruction Set Computer):

    1. ( , , , )

    2. ( )

    3. ( , - -, .)

    4. - ( , , - )

    5. ( IP- Ceva Tensillica)

  • ( scatter/gather)

  • , :

    1. ( )

    2. ( )

    3. (zero-overhead loops)

  • - ( , , QR-, )

  • (DMA), 2D/3D-

  • ( 20- 40- )

  • (in-order) , (speculation) (out-of-order)

  • (VLIW)

  • (exposed pipeline)

    1. DRAM-

    2. instruction- data- ( 1 ) SRAM-, .. scratchpad Tightly Coupled Memory,

    3. – , TCM,

  • (branch target buffer, BTB) DSP ( , , )

  • (.. , )

, DSP .

Texas Instruments:



   LDW  .D1T1 *A4++, A3

|| LDW  .D2T2 *B4++, B5  ;

   NOP        3          ;

   BDEC .S2   L2, B0     ; ( + + goto)

   ADD  .L1X  B5, A3, A3 ; 1

   STW  .D1T1 A3, *A5++  ; 2

  • BDEC

    , , 0

  • ||

  • NOP


DSP . , ( , ), , . , , , DSP.

, DSP : LTO (link-time optimization) PGO/FDO (profile-guided/feedback-driven optimization). , restrict/noalias, .

- (intrinsic), .. , , . , ( ).

, , , - : - - .

, , : , . ( Nvidia Nsight ). .

, DSP, .

DSP (, , , , IDE) :

  • ( )

  • host- intrinsic-, ( , - – “”)

  • (“linear assembler”), ( , )

  • , .. source-level ( )

  • boilerplate- ( RPC- DSP )


DSP . – –O2 –O0 - (.. out-of-order ). DSP . , , performance-critical .

DSP open-source :

  • Open64 ( Ceva Cadence/Tensilica)

  • GCC (Texas Instruments Qualcomm)

  • LLVM ( Ceva Qualcomm, Cadence/Tensilica)

C++, , OpenCL, OpenMP/OpenACC Halide.

DSP , .. (, DSP Hexagon AMDGPU , AArch64). , VLIW : , NOP’ . Intel Itanium, DSP VLIW, , “-” (“heroic compiler”, , ). , DSP ( Partitioned Boolean Quadratic Programming).

, , . , :


// (      )

#pragma MUST_ITERATE(min, max, multiple)

for (i = x; i < y; ++i)



#pragma FUNC_NO_GLOBAL_ASG(func)

extern void foo();



#pragma FUNC_NO_IND_ASG(func)

extern void bar();


// i    8

_nassert(i & 3 == 0);

, , ( __builtin_prefetch

GCC) , ( , -).

, ( ) , restrict . :

( -Msafeptr=all PGI).

, . :

int32x4_t acc = 0;

  int *p = ..., coeff = ...;

  for (i = 0; i < N; i += 4) {

    int32x4_t x = vload(&p[i]);

    acc = vmac(acc, coeff, x);

//  acc += coeff * x;    , ..     


intrinsic- . ( , .. ).

DSP ( , , , .) :

  • (unrolling) (software pipelining)

  • (if-conversion)

  • (induction variable renaming)


, .


for (i = 0; i < N; ++i)

  a[i] = b[i] * 3;  


ld (a0)+, a2

nop 3

mul a2, a3, a4

st  a4, (a1)+ 


tmp2 = b[0] * 3; tmp1 = b[1];

for (i = 0; i < N - 2; ++i) {

  a[i] = tmp2

  tmp2 = tmp1 * 3;

  tmp1 = b[i + 2]


a[N - 2] = tmp2; a[N - 1] = tmp1 * 3;



st a4, (a1)+ || ld (a0)+, a2 || mul a2, a3, a4 


, .

 for (i = ...) {

  if (p[i])

    x[i] = a * y;


    x[i] = b * z;


for (i = ...) {

  bool predicate = p[i];

  tmp1 = predicate ? a * y : 0;

  tmp2 = predicate ? b * z : 0;

  x[i] = predicate ? tmp1 : tmp2;


, ,

for (i = ...) {

  bool predicate = p[i];

  tmp1 = a * y;

  tmp2 = b * z;

  x[i] = predicate ? tmp1 : tmp2;


cmp a7, 0, p1

mul a0, a1, a2, p1

mul a3, a4, a2, !p1

, , p1

. “” (- ), ( ).

, , .

// Unroll 2

for (i = ...) {

  z[i] = a * x[i]


  z[i] = a * x[i];



// Dependencies removed

for (i1, i2 = ...) {

  z[i1] = a * x[i1]

  i1 += 2;

  z[i2] = a * x[i2];

  i2 += 2;


, i1


, .


  • Fifty years of signal processing

  • DSPs for Mobile Communication



  • J.A. Fisher et al. “Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools” (J.A. Fished - VLIW)

  • “Mill computing” YouTube ( )



DSP ( E. Belaish, CEVA)

, Senior Engineer, System-on-Chip SW Team, Samsung

Samsung: , .

Samsung Exynos DSP NPU. , , . Samsung state-of-the-art , . : -. .


