DSP-processors: purpose and features
: (general-purpose, x86) , , , . , : , Digital Signal Processors DSP.
DSP , ( ) , . ARM-, DSP.
DSP, , , .
DSP 1970- . - , , ( - ). , ( MOSFET) (, ) , .. , , . , (time-to-market), , . .
- DSP . - , .. , .
DSP
DSP , Intel Xeon Cortex-A, ? Intel.
, , , (out-of-order speculative execution) (scheduling). , “” , .. , , 1%:
While a simple arithmetic operation requires around 0.5–20 pJ, modern cores spend about 2000 pJ to schedule it.
Conventional multicore processors consume 157–707 times more energy than customized hardware designs.
( “Rise and Fall of Dark Silicon”, ).
, Intel DSP Texas Instruments ( Skylake Xeon Platinum 8180M TMS320C6713BZDP300):
|
CPU (Intel) |
DSP (TI) |
|
2.5 |
500 |
|
28 |
1 |
|
560 GIPS |
1.8 GIPS |
|
205 |
1 |
Out-of-order |
|
|
|
$13K (+ ) |
$35 ( ) |
Target applications |
|
- - - |
/ / |
0.097 GIPS// |
1.7 GIPS// ( 17 Intel) |
/ / / $ |
0.0075 MIPS///$ |
0.051 GIPS///$ ( 7000 Intel) |
, DSP , 1 4 (!) . : DSP : , , .. ( , ) .
DSP
DSP. Jennifer Eyre, BDTI, “ DSP , ” (“Architecture of DSP is molded by algorithms”, “Evolution of DSP Processors”). :
(ILP, Instruction Level Parallelism)
(, , )
,
, .
ILP :
(SIMD, Single Instruction Multiple Data)
(CISC, Complex Instruction Set Computer):
( , , , )
( )
( , - -, .)
- ( , , - )
( IP- Ceva Tensillica)
( scatter/gather)
, :
( )
( )
(zero-overhead loops)
- ( , , QR-, )
(DMA), 2D/3D-
( 20- 40- )
(in-order) , (speculation) (out-of-order)
(VLIW)
(exposed pipeline)
DRAM-
instruction- data- ( 1 ) SRAM-, .. scratchpad Tightly Coupled Memory,
– , TCM,
(branch target buffer, BTB) DSP ( , , )
(.. , )
, DSP .
Texas Instruments:
;
L2:
LDW .D1T1 *A4++, A3
|| LDW .D2T2 *B4++, B5 ;
NOP 3 ;
BDEC .S2 L2, B0 ; ( + + goto)
ADD .L1X B5, A3, A3 ; 1
STW .D1T1 A3, *A5++ ; 2
BDEC
, , 0
||
NOP
DSP
DSP . , ( , ), , . , , , DSP.
, DSP : LTO (link-time optimization) PGO/FDO (profile-guided/feedback-driven optimization). , restrict/noalias, .
- (intrinsic), .. , , . , ( ).
, , , - : - - .
, , : , . ( Nvidia Nsight ). .
, DSP, .
DSP (, , , , IDE) :
( )
host- intrinsic-, ( , - – “”)
(“linear assembler”), ( , )
, .. source-level ( )
boilerplate- ( RPC- DSP )
DSP
DSP . – –O2 –O0 - (.. out-of-order ). DSP . , , performance-critical .
DSP open-source :
Open64 ( Ceva Cadence/Tensilica)
GCC (Texas Instruments Qualcomm)
LLVM ( Ceva Qualcomm, Cadence/Tensilica)
C++, , OpenCL, OpenMP/OpenACC Halide.
DSP , .. (, DSP Hexagon AMDGPU , AArch64). , VLIW : , NOP’ . Intel Itanium, DSP VLIW, , “-” (“heroic compiler”, , ). , DSP ( Partitioned Boolean Quadratic Programming).
, , . , :
//
// ( )
#pragma MUST_ITERATE(min, max, multiple)
for (i = x; i < y; ++i)
...
//
#pragma FUNC_NO_GLOBAL_ASG(func)
extern void foo();
//
#pragma FUNC_NO_IND_ASG(func)
extern void bar();
// i 8
_nassert(i & 3 == 0);
, , ( __builtin_prefetch
GCC) , ( , -).
, ( ) , restrict . :
( -Msafeptr=all PGI).
, . :
int32x4_t acc = 0;
int *p = ..., coeff = ...;
for (i = 0; i < N; i += 4) {
int32x4_t x = vload(&p[i]);
acc = vmac(acc, coeff, x);
// acc += coeff * x; , ..
}
intrinsic- . ( , .. ).
DSP ( , , , .) :
(unrolling) (software pipelining)
(if-conversion)
(induction variable renaming)
.
, .
//
for (i = 0; i < N; ++i)
a[i] = b[i] * 3;
//
ld (a0)+, a2
nop 3
mul a2, a3, a4
st a4, (a1)+
//
tmp2 = b[0] * 3; tmp1 = b[1];
for (i = 0; i < N - 2; ++i) {
a[i] = tmp2
tmp2 = tmp1 * 3;
tmp1 = b[i + 2]
}
a[N - 2] = tmp2; a[N - 1] = tmp1 * 3;
//
st a4, (a1)+ || ld (a0)+, a2 || mul a2, a3, a4
.
, .
for (i = ...) {
if (p[i])
x[i] = a * y;
else
x[i] = b * z;
}
for (i = ...) {
bool predicate = p[i];
tmp1 = predicate ? a * y : 0;
tmp2 = predicate ? b * z : 0;
x[i] = predicate ? tmp1 : tmp2;
}
, ,
for (i = ...) {
bool predicate = p[i];
tmp1 = a * y;
tmp2 = b * z;
x[i] = predicate ? tmp1 : tmp2;
}
cmp a7, 0, p1
mul a0, a1, a2, p1
mul a3, a4, a2, !p1
, , p1
. “” (- ), ( ).
, , .
// Unroll 2
for (i = ...) {
z[i] = a * x[i]
++i;
z[i] = a * x[i];
++i;
}
// Dependencies removed
for (i1, i2 = ...) {
z[i1] = a * x[i1]
i1 += 2;
z[i2] = a * x[i2];
i2 += 2;
}
, i1
i2
, .
:
:
VLIW
J.A. Fisher et al. “Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools” (J.A. Fished - VLIW)
-
VLIW
Texas Instrument (User Programmer guides, App reports)
The Making of a Compiler for the Intel Itanium Processor
The Multiflow Trace Scheduling Compiler
DSP-C Specification Embedded-C extensions
DSP
CEVA Launches Machine Learning DSP Solution: CEVA-XM6 (anandtech)
CEVA-XC12 The world's most advanced communications DSP
DSP ( E. Belaish, CEVA)
Combining C code with assembly code in DSP applications
Architecture Oriented C Optimizations
Compiler optimization for DSP applications
, Senior Engineer, System-on-Chip SW Team, Samsung
Samsung: , .
Samsung Exynos DSP NPU. , , . Samsung state-of-the-art , . : -. .
→ Github