DSP-processors: purpose and features

source: https://innovas-services.fr/solving-business-problems/
source: https://innovas-services.fr/solving-business-problems/

DSP-processors: purpose and features

: (general-purpose, x86) , , , . , : , Digital Signal Processors DSP.





DSP , ( ) , . ARM-, DSP.





DSP, , , .





DSP 1970- . - , , ( - ). , ( MOSFET) (, ) , .. , , . , (time-to-market), , . .





Fig.  1 DSP's first major success: Speak & Spell tablet (Texas Instruments, 1978)
. 1 DSP: Speak&Spell (Texas Instruments, 1978)
Fig.  2 Since the advent of the GSM standard, DSPs have been an indispensable component of mobile networks.
. 2 GSM DSP
Fig.  3 Image processing in cameras (debayering, noise removal, filtering) is also done on DSP (source: https://snapshot.canon-asia.com/india/article/en/5-things-made-possible-with-digic- image-processor)
. 3 (, , ) DSP (: https://snapshot.canon-asia.com/india/article/en/5-things-made-possible-with-digic-image-processor)

- DSP . - , .. , .





DSP

DSP , Intel Xeon Cortex-A, ? Intel.





Fig.  4 Intel Skylake (source: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client))
. 4 Intel Skylake (: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client) )

, , , (out-of-order speculative execution) (scheduling). , “” , .. , , 1%:





While a simple arithmetic operation requires around 0.5–20 pJ, modern cores spend about 2000 pJ to schedule it.





Conventional multicore processors consume 157–707 times more energy than customized hardware designs.





( “Rise and Fall of Dark Silicon”, ).





, Intel DSP Texas Instruments ( Skylake Xeon Platinum 8180M TMS320C6713BZDP300):





 





CPU (Intel)





DSP (TI)









2.5





500









28





1









560 GIPS





1.8 GIPS









205





1





Out-of-order

















$13K (+ )





$35 ( )





Target applications









-





-





-





/ /





0.097 GIPS//





1.7 GIPS//





( 17 Intel)





/ / / $





0.0075 MIPS///$





0.051 GIPS///$





( 7000 Intel)





 





, DSP , 1 4 (!) . : DSP : , , .. ( , ) .





DSP

DSP. Jennifer Eyre, BDTI, “ DSP , ” (“Architecture of DSP is molded by algorithms”, “Evolution of DSP Processors”). :





  • (ILP, Instruction Level Parallelism)





  • (, , )





  • ,





, .





ILP :





  • (SIMD, Single Instruction Multiple Data)





  • (CISC, Complex Instruction Set Computer):





    1. ( , , , )





    2. ( )





    3. ( , - -, .)





    4. - ( , , - )





    5. ( IP- Ceva Tensillica)





  • ( scatter/gather)





  • , :





    1. ( )





    2. ( )









    3. (zero-overhead loops)





  • - ( , , QR-, )





  • (DMA), 2D/3D-













  • ( 20- 40- )





  • (in-order) , (speculation) (out-of-order)





  • (VLIW)





  • (exposed pipeline)









    1. DRAM-





    2. instruction- data- ( 1 ) SRAM-, .. scratchpad Tightly Coupled Memory,





    3. – , TCM,





  • (branch target buffer, BTB) DSP ( , , )





  • (.. , )





, DSP .





Texas Instruments:





;







L2:







   LDW  .D1T1 *A4++, A3







|| LDW  .D2T2 *B4++, B5  ;







   NOP        3          ;







   BDEC .S2   L2, B0     ; ( + + goto)







   ADD  .L1X  B5, A3, A3 ; 1







   STW  .D1T1 A3, *A5++  ; 2











  • BDEC



    , , 0





  • ||







  • NOP











DSP

DSP . , ( , ), , . , , , DSP.





, DSP : LTO (link-time optimization) PGO/FDO (profile-guided/feedback-driven optimization). , restrict/noalias, .





- (intrinsic), .. , , . , ( ).





, , , - : - - .





, , : , . ( Nvidia Nsight ). .





, DSP, .





DSP (, , , , IDE) :





  • ( )





  • host- intrinsic-, ( , - – “”)





  • (“linear assembler”), ( , )





  • , .. source-level ( )





  • boilerplate- ( RPC- DSP )





DSP

DSP . – –O2 –O0 - (.. out-of-order ). DSP . , , performance-critical .





DSP open-source :





  • Open64 ( Ceva Cadence/Tensilica)





  • GCC (Texas Instruments Qualcomm)





  • LLVM ( Ceva Qualcomm, Cadence/Tensilica)





C++, , OpenCL, OpenMP/OpenACC Halide.





DSP , .. (, DSP Hexagon AMDGPU , AArch64). , VLIW : , NOP’ . Intel Itanium, DSP VLIW, , “-” (“heroic compiler”, , ). , DSP ( Partitioned Boolean Quadratic Programming).





, , . , :





//      

// (      )

#pragma MUST_ITERATE(min, max, multiple)

for (i = x; i < y; ++i)

  ...

//     

#pragma FUNC_NO_GLOBAL_ASG(func)

extern void foo();

 

//        

#pragma FUNC_NO_IND_ASG(func)

extern void bar();

 

// i    8

_nassert(i & 3 == 0);
      
      



, , ( __builtin_prefetch



GCC) , ( , -).





, ( ) , restrict . :

















( -Msafeptr=all PGI).





, . :





int32x4_t acc = 0;

  int *p = ..., coeff = ...;

  for (i = 0; i < N; i += 4) {

    int32x4_t x = vload(&p[i]);

    acc = vmac(acc, coeff, x);

//  acc += coeff * x;    , ..     

  }
      
      



intrinsic- . ( , .. ).





DSP ( , , , .) :





  • (unrolling) (software pipelining)





  • (if-conversion)





  • (induction variable renaming)





.





, .





//  

for (i = 0; i < N; ++i)

  a[i] = b[i] * 3;  

// 

ld (a0)+, a2

nop 3

mul a2, a3, a4

st  a4, (a1)+ 
      
      







//  

tmp2 = b[0] * 3; tmp1 = b[1];

for (i = 0; i < N - 2; ++i) {

  a[i] = tmp2

  tmp2 = tmp1 * 3;

  tmp1 = b[i + 2]

}

a[N - 2] = tmp2; a[N - 1] = tmp1 * 3;

 

// 

st a4, (a1)+ || ld (a0)+, a2 || mul a2, a3, a4 
      
      



.





, .





 for (i = ...) {

  if (p[i])

    x[i] = a * y;

  else

    x[i] = b * z;

} 
      
      







for (i = ...) {

  bool predicate = p[i];

  tmp1 = predicate ? a * y : 0;

  tmp2 = predicate ? b * z : 0;

  x[i] = predicate ? tmp1 : tmp2;

} 
      
      



, ,





for (i = ...) {

  bool predicate = p[i];

  tmp1 = a * y;

  tmp2 = b * z;

  x[i] = predicate ? tmp1 : tmp2;

}
      
      







cmp a7, 0, p1







mul a0, a1, a2, p1







mul a3, a4, a2, !p1







, , p1



. “” (- ), ( ).





, , .





// Unroll 2

for (i = ...) {

  z[i] = a * x[i]

  ++i;

  z[i] = a * x[i];

  ++i;

} 
      
      







// Dependencies removed

for (i1, i2 = ...) {

  z[i1] = a * x[i1]

  i1 += 2;

  z[i2] = a * x[i2];

  i2 += 2;

} 
      
      



, i1



i2



, .





:





  • Fifty years of signal processing





  • DSPs for Mobile Communication





:





VLIW





  • J.A. Fisher et al. “Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools” (J.A. Fished - VLIW)





  • “Mill computing” YouTube ( )





VLIW





DSP





DSP ( E. Belaish, CEVA)





, Senior Engineer, System-on-Chip SW Team, Samsung





Samsung: , .





Samsung Exynos DSP NPU. , , . Samsung state-of-the-art , . : -. .





Github








All Articles