Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical...
Transcript of Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical...
![Page 1: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/1.jpg)
August 28, 2018
![Page 2: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/2.jpg)
Clock and cores
Circuits are still getting larger at a steady pace...
… but speed-up is no longer the result of a faster clock. Instead, we get more parallel devices
Intel desktop processor top model
![Page 3: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/3.jpg)
The end of scaling
Original data collected and plotted by M. Horowitz, F. Labonte, O.Shacham, K. Olukotun, L. Hammond, and C. Batten. Dotted line extrapolations by C. Moore.C. Moore, Data Processing in ExaScale-ClassComputer Systems, April 2011
![Page 4: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/4.jpg)
Performance = parallelism. Efficiency = locality.
It really is that simple. (Bill Dally)
approximate power in 28nm processorBill Dally
Power efficiency in CMOS technology
![Page 5: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/5.jpg)
Parallelism and locality
Computer architecture
Programming
for(int i=1;i<=N_a;i++){ for(int j=1;j<=N_b;j++){ temp[0] = H[i-1][j-1]+ sim_score(seq_a[i-1],seq_b[j-1]); temp[1] = H[i-1][j]-delta; temp[2] = H[i][j-1]-delta; temp[3] = 0.; H[i][j] = find_array_max(temp,4); switch(ind){ case 0: I_i[i][j] = i-1; I_j[i][j] = j-1; break; case 1: I_i[i][j] = i-1; I_j[i][j] = j; break; case 2: I_i[i][j] = i; I_j[i][j] = j-1; break; case 3: I_i[i][j] = i; I_j[i][j] = j; break; } } }
???
![Page 6: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/6.jpg)
New Trends for Processing Architectures
Programming Applications on Parallel Machines
Am2045
TILE 64
GEFORGE 9400
PC102
SEAforth40C18
C w/ “iLib”
SOPM (“ajava”) C + VHDL
VentureForth
CUDA
HDL
Virtex
C + VHDL + Open CL
KALRAY BostanMPPA
C / C++ + Open CL
Xilinx Ultrascale + MPSoC
![Page 7: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/7.jpg)
Programming Applications on Sequential Machines
Pentium IV
PDP-11
FujitsuSuperSPARC
68020
“Von Neumann” machineC, Pascal, Forth, Java, Lisp, Fortran, Haskell,...
![Page 8: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/8.jpg)
Programming Concurrent Heterogeneous Machines?
??Processor
1
Multi-core1 to 16
Pentium IV
Intel® Core™ i9-7960X Processor
KALRAY BostanMPPA
NvidiaGEFORCE9400
FPGA1000s ~ 10000s
Xilinx Ultrascale + MPSoC
Many-core1 to 288
![Page 9: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/9.jpg)
Platform(physical parallelism)
1 10 100 1000
1
10
100
1000
Application(explicit parallelism)
parallelizing compilers
Parallelization
Sequentialization
Portable Concurrency: the Need of a Paradigm Shift
![Page 10: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/10.jpg)
Which Features do we need?
For software to scale to future parallel computing platforms as seamlessly as possible, it will be necessary to describe algorithms in a way that:1) exposes as much parallelism of the application as practical,2) provides simple and natural abstractions that help manage the high degree of parallelism and permits principled composition of and interaction between modules,3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on,4) is efficiently implementable on a wide range of computing substrates, including traditional sequential processors, shared-memory multicores, manycore processor arrays, and programmable logic devices, as well as their combinations.
Looking at the above criteria: shared-memory threads fulfill the first requirement, but essentially fail on the other three. hardware description languages such as VHDL and Verilog fulfill the first two criteria, but as they fail on the third point by assuming a particular model of (clocked) time, their implementability is essentially limited to hardware and hardware-like programmable logic (FPGAs).
![Page 11: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/11.jpg)
Problems with sequential computation models
No explicit parallelism is available: lack of expressivity
A deterministic sequence of low level operation is fully specified: no distinction between what is necessary for a correct semantics and what is not (memory bandwidth optimizations are more difficult)
Memory bandwidth does not scale with the number of cores/processors
![Page 12: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/12.jpg)
Why threading does not solve the problem?
Threads are sequential processes that share memory.
Pros:Languages require little or no syntactic changes to support threads, and operating systems and architectures have evolved to efficiently support them.
Cons:They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism.
![Page 13: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/13.jpg)
Threads do not compose
n = 0;
t1 = n;n = t1 + 1;
t2 = n;n = t2 + 1;
Thread race on shared state
Thread 1: Thread 2:
Each thread is determinate.Their composition is not.In the domain of threads, determinacy is not compositional.
![Page 14: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/14.jpg)
Dataflow programs: not exactly a new model …….
digital signal processingmedia processingvideo codingimage processing / analyticsaudionetworking / packet processing...
![Page 15: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/15.jpg)
Data flow schemas (Dennis et al.)
networks of (potentially concurrent) functionstriggered by the availability of input dataexecuting in a sequence of atomic transitionsdescribed by firing rules
![Page 16: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/16.jpg)
Kahn process networks (Kahn)
networks of stream functions… described as communicating tasks with blocking readsprecise criterion for determinacy (prefix-monotonicity)formal fixpoint semantics
![Page 17: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/17.jpg)
Actors (Hewitt et al.)
Stateful, active objects (actors)Communicating via message passingNon-determinism and partial ordersFunctional actors with fixpoint semantics
![Page 18: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/18.jpg)
More Recent Dataflow Concepts
• computational kernels (aka “actors”)
•directed point-to-point connections
• lossless, order-preserving• asynchronous• conceptually unbounded
(but...)
• communicating discrete data packets (“tokens”)
![Page 19: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/19.jpg)
Actors (& Actions)
Actions
State
• actors encapsulate state• only interact via ports• execution is a sequence of steps• each step executes an “action”• an action can: 1) read tokens 2) write tokens 3) modify actor state
written in CAL(Cal Actor Language)
![Page 20: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/20.jpg)
CAL a formal language for dataflow: origin and features
The CAL language is a data flow oriented language that has been specified and developed as subproject of the Ptolemy project at the University of California at Berkeley.
The initial CAL language specification has been released in December 2003. A ISO standard version of CAL has been released in Oct 2009.
CAL is a textual language that is used to define the functionality of dataflow components called “actors”.
CAL is implemented in at least three publicly available modeling and simulation environments, Ptolemy II, OpenDF and Orcc.
An actor is a modular component that encapsulates its own state, no other actor has access to it, and nothing other actors can do can modify the state of an actor.
The only interaction between actors is through FIFO channels connecting “output ports” to “input ports,” which they use to send and receive “tokens”.
This strong encapsulation leads to loosely coupled systems with very manageable and controllable actor interfaces.
The modularity of an actor assembly fosters concurrent development, it facilitates maintainability and understandability and makes systems constructed in this way more robust and easier to modify.
![Page 21: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/21.jpg)
21
CAL - A crash course
actor Sum () A ==> X:
s := 0;
action A: [a] ==> X: [s] do s := s + a; endend
actor Add () A, B ==> X:
action A: [a], B: [b] ==> X: [a + b] endend
actor Merge () A, B ==> X:
action A: [v] ==> X: [v] end action B: [v] ==> X: [v] endend
actor Split () A ==> P, N:
action A: [v] ==> P: [v] guard v >= 0 end
action A: [v] ==> N: [-v] guard v < 0 endend
actor BiasedMerge () A, B ==> X:
CopyA: action A: [v] ==> X: [v] end CopyB: action B: [v] ==> X: [v] end
priority CopyA > CopyB; endend
1. consumption & production of tokens
2. state
3. multiple actions
4. guards5. priorities
![Page 22: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/22.jpg)
Actors: big and small2474, 0, 2482, 3034, 2490, 3026, 2498, 3018,2506, 2978, 2514, 2890, 2522, 2770, 2530, 2714,2538, 2658, 2546, 2634, 2554, 2610, 2562, 2586,2570, 2578, 1, 1, 1, 1, 2594, 2602,1, 1, 1, 1, 2618, 2626, 128, -128,124, -124, 2642, 2650, 120, -120, 116, -116,2666, 2690, 2674, 2682, 112, -112, 108, -108,2698, 2706, 104, -104, 100, -100, 2722, 2746,2730, 2738, 96, -96, 92, -92, 2754, 2762,88, -88, 84, -84, 2778, 2834, 2786, 2810,2794, 2802, 80, -80, 76, -76, 2818, 2826,72, -72, 68, -68, 2842, 2866, 2850, 2858,64, -64, 60, -60, 2874, 2882, 56, -56,52, -52, 2898, 2970, 2906, 2946, 2914, 2938,2922, 2930, 48, -48, 44, -44, 40, -40,2954, 2962, 36, -36, 32, -32, 28, -28,2986, 3010, 2994, 3002, 24, -24, 20, -20,16, -16, 12, -12, 8, -8, 4, -4];
// Starting indices into VLD table aboveint MCBPC_IVOP_START_INDEX = 0;int MCBPC_PVOP_START_INDEX = 16;int CBPY_START_INDEX = 58;int DCBITS_Y_START_INDEX = 92;int DCBITS_UV_START_INDEX = 118;int COEFF_INTER_START_INDEX = 144;int COEFF_INTRA_START_INDEX = 380;int MV_START_INDEX = 616;
// VLD decode engine.int( size=VLD_TABLE_ADDR_BITS ) vld_index;int( size=VLD_TABLE_DATA_BITS ) vld_codeword := 1;procedure start_vld_engine( int index )begin
vld_index := index;vld_codeword := 2;
endfunction vld_success() --> bool : bitand(vld_codeword,3) = 0 endfunction vld_failure() --> bool : bitand(vld_codeword,1) = 1 endfunction vld_result() --> int( size=VLD_TABLE_DATA_BITS ) : rshift(vld_codeword,2) endaction bits:[ b ] ==>guard
bitand(vld_codeword,3) = 2 do
vld_codeword := vld_table[ vld_index + if b then 1 else 0 end ];vld_index := rshift(vld_codeword,2);bit_count := bit_count + 1;
endvld_failure: action ==>guard
vld_failure()// do// println("Bad VLD codeword");
end
generic_done: action ==>guard
done_reading_bits()end
generic_done_with_bitread: action bits:[b] ==>guard
done_reading_bits()do
bit_count := bit_count + 1;end
// stuck: action ==>// do// set_bits_to_read(0);// reset_texture();
// println("Stuck at bit_count " + bit_count);// end
test_zero_byte: action ==>guard
done_reading_bits(),bitand( read_result(), 255 ) = 0
end
test_one_byte: action ==>guard
done_reading_bits(),bitand( read_result(), 255 ) = 1
end
request_byte: action ==>guard
done_reading_bits()do
set_bits_to_read( 8 );end
request_vol: action ==>do
set_bits_to_read( VOL_START_CODE_LENGTH );end
schedule fsm stuck_1a /* look_for_vo */ :
// Process a new VOLlook_for_vo ( look_for_vo ) --> vo_header;vo_header ( generic_done ) --> stuck;vo_header ( vo_header.good ) --> look_for_vol;
// Process a new VOLlook_for_vol ( generic_done ) --> stuck;look_for_vol ( vol_startcode.good ) --> vol_object;vol_object ( vol_object_layer_identification ) --> vol_aspect;vol_aspect ( vol_aspect.detailed ) --> vol_control;vol_aspect ( generic_done ) --> vol_control;vol_control ( vol_control.detailed ) --> vol_vbv;vol_control ( generic_done_with_bitread ) --> vol_shape;vol_vbv ( vol_vbv.detailed ) --> vol_shape;vol_vbv ( generic_done_with_bitread ) --> vol_shape;vol_shape ( vol_shape ) --> vol_time_inc_res;vol_time_inc_res ( vol_time_inc_res ) --> vol_width;vol_width ( vol_width ) --> vol_height;vol_height ( vol_height ) --> vol_misc;// vol_misc ( vol_misc.unsupported ) --> stuck;vol_misc ( generic_done ) --> look_for_vop;
// Process a new VOPlook_for_vop ( byte_align ) --> get_vop_code;get_vop_code ( get_vop_code ) --> vop_code;vop_code ( vo_header.good ) --> look_for_vol; // DBP: permit concatenation of sequences without proper end codevop_code ( vop_code.done ) --> look_for_vo; // DBP: start from the beginningvop_code ( vop_code.start ) --> vop_predict;vop_code ( generic_done ) --> stuck;vop_predict ( vop_predict /* .supported */ ) --> vop_timebase;// vop_predict ( generic_done ) --> stuck;vop_timebase ( vop_timebase.one ) --> vop_timebase;vop_timebase ( vop_timebase.zero ) --> vop_coding;vop_coding ( vop_coding.uncoded ) --> look_for_vop;
vop_coding ( vop_coding.coded ) --> send_new_vop_info;send_new_vop_info ( send_new_vop_cmd ) --> send_new_vop_width;send_new_vop_width ( send_new_vop_width ) --> send_new_vop_height;send_new_vop_height( send_new_vop_height ) --> mb;
// Start MBmb ( mb_done ) --> look_for_vop;
mb ( mcbpc_pvop_uncoded ) --> pvop_uncoded1;pvop_uncoded1 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded2;pvop_uncoded2 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded3;pvop_uncoded3 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded4;pvop_uncoded4 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded5;pvop_uncoded5 ( mcbpc_pvop_uncoded1 ) --> mb;
mb ( get_mcbpc ) --> get_mbtype;get_mbtype ( vld_failure ) --> stuck;get_mbtype ( get_mbtype ) --> final_cbpy;final_cbpy ( vld_failure ) --> stuck;final_cbpy ( final_cbpy_intra ) --> block;final_cbpy ( final_cbpy_inter ) --> mv;
// process all blocks in an MBblock ( mb_dispatch_done ) --> mb;block ( mb_dispatch_intra ) --> texture;block ( mb_dispatch_inter_ac_coded ) --> texture;block ( mb_dispatch_inter_no_ac ) --> block;
// Start texturetexture ( vld_start_intra ) --> get_dc_bits;texture ( vld_start_inter ) --> texac;get_dc_bits ( get_dc_bits.none ) --> texac;get_dc_bits ( get_dc_bits.some ) --> get_dc;get_dc_bits ( vld_failure ) --> stuck;get_dc ( dc_bits_shift ) --> get_dc_a;get_dc_a ( get_dc ) --> texac;texac ( block_done ) --> block;texac ( dct_coeff ) --> vld1;vld1 ( vld_code ) --> texac;vld1 ( vld_level ) --> vld4;vld1 ( vld_run_or_direct ) --> vld7;vld7 ( vld_run ) --> vld6;vld7 ( vld_direct_read ) --> vld_direct;vld1 ( vld_failure ) --> stuck;vld_direct ( vld_direct ) --> texac;vld4 ( do_level_lookup ) --> vld4a;vld4 ( vld_failure ) --> stuck;vld4a ( vld_level_lookup ) --> texac;vld6 ( do_run_lookup ) --> vld6a;vld6 ( vld_failure ) --> stuck;vld6a ( vld_run_lookup ) --> texac;
// mv()mv ( mvcode_done ) --> block;mv ( mvcode ) --> mag_x;mag_x ( vld_failure ) --> stuck;mag_x ( mag_x ) --> get_residual_x; get_residual_x ( get_residual_x ) --> mv_y;mv_y ( mvcode ) --> mag_y;mag_y ( vld_failure ) --> stuck;mag_y ( mag_y ) --> get_residual_y;get_residual_y ( get_residual_y ) --> mv;
// stuck( stuck ) --> stuck_for_good;
// DBP: add minimal error resilience.// byte align, then look for a VO header starting on any byte boundary.// Can't handle bit insertion or deletion, obviously. The VO header// is hex 00000100.stuck ( byte_align ) --> stuck_1a;
stuck_1a ( request_byte ) --> stuck_1b;stuck_1b ( test_zero_byte ) --> stuck_2a;stuck_1b ( generic_done ) --> stuck_1a;
stuck_2a ( request_byte ) --> stuck_2b;stuck_2b ( test_zero_byte ) --> stuck_3a;stuck_2b ( generic_done ) --> stuck_1a;
stuck_3a ( request_byte ) --> stuck_3b;stuck_3b ( test_zero_byte ) --> stuck_3a;stuck_3b ( test_one_byte ) --> stuck_4a;stuck_3b ( generic_done ) --> stuck_1a;
stuck_4a ( request_byte ) --> stuck_4b;stuck_4b ( test_zero_byte ) --> request_vol; // Found a good VO header (Assumes VO ID field=0)stuck_4b ( generic_done ) --> stuck_1a;
request_vol ( request_vol ) --> look_for_vol;
end
priority
vo_header.good > generic_done;vol_aspect.detailed > generic_done;vol_control.detailed > generic_done_with_bitread;vol_vbv.detailed > generic_done_with_bitread;// vol_misc.unsupported > generic_done;
vop_code.done > generic_done;vop_code.start > generic_done;// vop_predict.supported > generic_done;vop_timebase.one > vop_timebase.zero;vop_coding.uncoded > vop_coding.coded;
mb_done > get_mcbpc;mb_done > mcbpc_pvop_uncoded;get_mcbpc.pvop > mcbpc_pvop_uncoded;
get_mbtype.noac > get_mbtype.ac;final_cbpy_inter > final_cbpy_intra;mb_dispatch_done > mb_dispatch_intra >mb_dispatch_inter_no_ac > mb_dispatch_inter_ac_coded;
vld_start_intra > vld_start_inter;get_dc_bits.none > get_dc_bits.some;block_done > dct_coeff;vld_code > vld_level > vld_run_or_direct;vld_run > vld_direct_read;
vld_start_inter.ac_coded > vld_start_inter.not_ac_coded;
mvcode_done > mvcode;
test_zero_byte > generic_done;test_one_byte > generic_done;
end
end
/* BEGINCOPYRIGHT X
Copyright (c) 2004-2006, Xilinx Inc.All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:- Redistributions of source code must retain the above
copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
ENDCOPYRIGHT*/// ParseHeaders.cal//// Author: David B. Parlour ([email protected])//
import all caltrop.lib.BitOps;
// BTYPE Output// A single stream of control tokens is sent on the BTYPE output to distribute// VOP and block type information. A minimum of 12 bits is required for this token,// which also sends the VOP width and height parameters. The relevant parameters are:
// MB_COORD_SZ - size of any variable that contains a macroblock coordinate// BTYPE_SZ - size of the BTYPE output (min. 12)
// Bits to signal token type// NEWVOP 2048// INTRA 1024// INTER 512
// If the NEWVOP bits is set, the following field can be masked off:// QUANT_MASK 31// ROUND_TYPE 32// FCODE_MASK 448// FCODE_SHIFT 6
// The next two tokens after NEWVOP are the width and height in macroblocks
// The INTER type also has// ACPRED 1// ACCODED 2// FOURMV 4// MOTION 8
// The INTRA type also has ACPRED
actor ParseHeaders (// Maximum image width (in units of macroblocks) that the decoder can handle.// It is used to allocate line buffer space at compile time.int MAXW_IN_MB,
int MB_COORD_SZ,int BTYPE_SZ,int MV_SZ,int NEWVOP,int INTRA,int INTER,int QUANT_MASK,int ROUND_TYPE,int FCODE_MASK,int FCODE_SHIFT,int ACPRED,int ACCODED,int FOURMV,int MOTION,int SAMPLE_COUNT_SZ,int SAMPLE_SZ
) bool bits ==> int(size=BTYPE_SZ) BTYPE, int(size=MV_SZ) MV, int(size=SAMPLE_COUNT_SZ) RUN, int(size=SAMPLE_SZ) VALUE, bool LAST :
// Constants for various field lengths (bits) and special values.
int VO_HEADER_LENGTH = 27;int VO_NO_SHORT_HEADER = 8;int VO_ID_LENGTH = 5;
int VOL_START_CODE_LENGTH = 28;int VOL_START_CODE = 18;int VOL_ID_LENGTH = 5;int VIDEO_OBJECT_TYPE_INDICATION_LENGTH = 8;int VISUAL_OBJECT_LAYER_VERID_LENGTH = 4;int VISUAL_OBJECT_LAYER_PRIORITY_LENGTH = 3;int ASPECT_RATIO_INFO_LENGTH = 4;int ASPECT_RATIO_INFO_IS_DETAILED = 15;int PAR_WIDTH_LENGTH = 8;int PAR_HEIGHT_LENGTH = 8;int CHROMA_FORMAT_LENGTH = 2;int LOW_DELAY_LENGTH = 1;int FIRST_HALF_BIT_RATE_LENGTH = 15;int LAST_HALF_BIT_RATE_LENGTH = 15;int FIRST_HALF_VBV_BUF_SZ_LENGTH = 15;int LAST_HALF_VBV_BUF_SZ_LENGTH = 3;int FIRST_HALF_VBV_OCC_LENGTH = 11;int LAST_HALF_VBV_OCC_LENGTH = 15;int VOL_SHAPE_LENGTH = 2;int MARKER_LENGTH = 1;int TIME_INC_RES_LENGTH = 16;int VOL_WIDTH_LENGTH = 13;int VOL_HEIGHT_LENGTH = 13;
//int TIME_INC_RES_MASK = lshift( 1, TIME_INC_RES_LENGTH ) - 1;// Note: mask off width, height in units of MBs, not pixels//int VOL_WIDTH_MASK = lshift( 1, VOL_WIDTH_LENGTH - 4 ) - 1;//int VOL_HEIGHT_MASK = lshift( 1, VOL_HEIGHT_LENGTH - 4 ) - 1;//int ASPECT_RATIO_INFO_MASK = lshift( 1, ASPECT_RATIO_INFO_LENGTH ) - 1;
int RUN_LENGTH = 6;int RUN_MASK = lshift( 1, RUN_LENGTH ) - 1;int LEVEL_LENGTH = 12;int LEVEL_MASK = lshift( 1, LEVEL_LENGTH ) - 1;
// count of misc VOL bits:// interlaced OBMC_disabled vol_sprite_usage not_8_bit// vol_quant_type complexity_est_disable resync_marker_disable data_partitioning_enable// scalability// Note: We assume data_partitioning_enable is always false, so the model// does not do a conditional read of reversible_vlc_enableint MISC_BIT_LENGTH = 9;
int VOP_START_CODE_LENGTH = 32;int VOP_START_CODE = 438;int VOP_PREDICTION_LENGTH = 2;int B_VOP = 2;int P_VOP = 1;int I_VOP = 0;int INTRA_DC_VLC_THR_LENGTH = 3;int VOP_FCODE_FOR_LENGTH = 3;int VOP_FCODE_FOR_MASK = lshift( 1, VOP_FCODE_FOR_LENGTH ) - 1;int BITS_QUANT = 5;int BITS_QUANT_MASK = lshift( 1, BITS_QUANT ) - 1;int MCBPC_LENGTH = 9;int ESCAPE_CODE = 7167;
function mask_bits( v, n ) --> int :bitand( v, lshift(1,n)-1 )
end
// Utility action to read a specified number of bits.// This is an unnamed action, ie it is always enabled and has highest priority.// Use the procedures set_bits_to_read() to start reading, test for// completion with the boolean done_reading_bits() and get the value// with read_result(). Use the done function in a guard to wait for the// reading to be complete.int(size=7) bits_to_read_count := -1;int(size=33) read_result_in_progress;procedure set_bits_to_read( int count )begin
bits_to_read_count := count - 1;read_result_in_progress := 0;
endfunction done_reading_bits() --> bool : bits_to_read_count < 0 endfunction read_result() --> int : read_result_in_progress endaction bits:[ b ] ==>guard
not done_reading_bits()doread_result_in_progress := bitor( lshift( read_result_in_progress, 1), if b then 1 else 0 end );bits_to_read_count := bits_to_read_count - 1;bit_count := bit_count + 1;
end
int(size=4) bit_count := 0;
/********************************************************************************************************************************** start VOL **********************************************************************************************************************************/
// DBP: modified to support only VO_NO_SHORT_HEADERlook_for_vo: action ==>do
set_bits_to_read( VO_HEADER_LENGTH + VO_ID_LENGTH );end
// We can only handle VOL without short headervo_header.good: action ==>guard
done_reading_bits(),mask_bits( rshift( read_result(), VO_ID_LENGTH ), VO_HEADER_LENGTH ) = VO_NO_SHORT_HEADER
doset_bits_to_read( VOL_START_CODE_LENGTH );
end
vol_startcode.good: action ==>guard
done_reading_bits(),mask_bits( read_result(), VOL_START_CODE_LENGTH ) = VOL_START_CODE
do// Ignore the next two fieldsset_bits_to_read( VOL_ID_LENGTH + VIDEO_OBJECT_TYPE_INDICATION_LENGTH );
end
vol_object_layer_identification: action bits:[b] ==>guard
done_reading_bits()do
set_bits_to_read(if b then
// is_object_layer_identifier assertedVISUAL_OBJECT_LAYER_VERID_LENGTH + VISUAL_OBJECT_LAYER_PRIORITY_LENGTH + ASPECT_RATIO_INFO_LENGTH
elseASPECT_RATIO_INFO_LENGTH
end );bit_count := bit_count + 1;
end
vol_aspect.detailed: action ==>guard
done_reading_bits(),mask_bits( read_result(), ASPECT_RATIO_INFO_LENGTH ) = ASPECT_RATIO_INFO_IS_DETAILED
do// Skip over aspect ratio detailsset_bits_to_read( PAR_WIDTH_LENGTH + PAR_HEIGHT_LENGTH );
end
vol_control.detailed: action bits:[b] ==>guard
done_reading_bits(),b
doset_bits_to_read( CHROMA_FORMAT_LENGTH + LOW_DELAY_LENGTH );bit_count := bit_count + 1;
end
vol_vbv.detailed: action bits:[b] ==>guard
done_reading_bits(),b
doset_bits_to_read( FIRST_HALF_BIT_RATE_LENGTH + MARKER_LENGTH +
LAST_HALF_BIT_RATE_LENGTH + MARKER_LENGTH +FIRST_HALF_VBV_BUF_SZ_LENGTH + MARKER_LENGTH +LAST_HALF_VBV_BUF_SZ_LENGTH +FIRST_HALF_VBV_OCC_LENGTH + MARKER_LENGTH +LAST_HALF_VBV_OCC_LENGTH + MARKER_LENGTH );
bit_count := bit_count + 1;end
vol_shape: action ==>guard
done_reading_bits()do
set_bits_to_read( VOL_SHAPE_LENGTH + MARKER_LENGTH + TIME_INC_RES_LENGTH + MARKER_LENGTH + 1 );end
int(size=7) mylog;
vol_time_inc_res: action ==>guard
done_reading_bits()var
int(size=TIME_INC_RES_LENGTH+1) time_inc_res := mask_bits( rshift( read_result(), 2 ), TIME_INC_RES_LENGTH ), // pick out the time_inc_res fieldint(size=7) count := 0,int(size=7) ones := 0
dowhile ( count = 0 or time_inc_res != 0 ) do
if bitand( time_inc_res, 1 ) = 1 thenones := ones + 1;
endcount := count + 1;time_inc_res := rshift( time_inc_res, 1 );
endmylog := if ones > 1 then count else count - 1 end;mylog := if mylog < 1 then 1 else mylog end;set_bits_to_read(if bitand( read_result(), 1 ) = 1 then
// fixed vop rate//mylog + MARKER_LENGTH mylog
else// variable vop rate//MARKER_LENGTH0
end + MARKER_LENGTH + VOL_WIDTH_LENGTH + MARKER_LENGTH );end
// The vol width and height in units of macroblocks, ie. the pixel w/h divided by 16.// Note: there is no provision in the model for pixel sizes that are no multiples of 16.int(size=MB_COORD_SZ) vol_width;int(size=MB_COORD_SZ) vol_height;
vol_width: action ==>guard
done_reading_bits()do
vol_width := mask_bits( rshift( read_result(), MARKER_LENGTH + 4 ), VOL_WIDTH_LENGTH-4 ); // strip marker and divide by 16set_bits_to_read( VOL_HEIGHT_LENGTH + MARKER_LENGTH );
end
vol_height: action ==>guard
done_reading_bits()do
vol_height := mask_bits( rshift( read_result(), MARKER_LENGTH + 4 ), VOL_HEIGHT_LENGTH ); // strip marker and divide by 16set_bits_to_read( MISC_BIT_LENGTH );
end
// Detect unsupported features:// interlaced, sprites, not 8-bit pixels, not using method 2 quantization, data partitioning, scalability// vol_misc.unsupported: action ==>// guard// done_reading_bits(),
// test mask is 9-bit binary 101110011// bitand( read_result(), 371 ) != 0 // end
/********************************************************************************************************************************** start VOP **********************************************************************************************************************************/
byte_align: action ==>do
set_bits_to_read( 8 - bitand( bit_count, 7 ) );end
// Note: no check for correct bit stuffing
get_vop_code: action ==>guard
done_reading_bits()do
set_bits_to_read( VOP_START_CODE_LENGTH );end
// Note: the model does not support user data herevop_code.done: action ==>guard
done_reading_bits(),read_result() = 1
dobit_count := 0;
end
int(size=MB_COORD_SZ) mbx;int(size=MB_COORD_SZ) mby;
vop_code.start: action ==>guard
done_reading_bits(),read_result() = VOP_START_CODE
dombx := 0;mby := 0;set_bits_to_read( VOP_PREDICTION_LENGTH );
end
// int(size=2) prediction_type;
// vop_predict.supported: action ==>// guard// done_reading_bits(),// mask_bits( read_result(), VOP_PREDICTION_LENGTH ) = I_VOP or// mask_bits( read_result(), VOP_PREDICTION_LENGTH ) = P_VOP// do// prediction_type := read_result();// end
bool prediction_is_IVOP;
vop_predict: action ==>guard
done_reading_bits()do
prediction_is_IVOP := mask_bits( read_result(), VOP_PREDICTION_LENGTH ) = I_VOP;end
// Note: the model does not support time_base. It just skips over the right number of bits.vop_timebase.one: action bits:[b] ==>guard
bdo
// Skip over module-time_basebit_count := bit_count + 1;
end
vop_timebase.zero:action bits:[b] ==>do
bit_count := bit_count + 1;set_bits_to_read( MARKER_LENGTH + mylog + MARKER_LENGTH );
end
int(size=4) comp;
// TODO: the model does not communicate to the display driver// to re-use the current VOP in place of the uncoded one.vop_coding.uncoded: action bits:[b] ==>guard
done_reading_bits(),not b
docomp := 0;bit_count := bit_count + 1;
end
vop_coding.coded: action bits:[b] ==>guard
done_reading_bits()do
set_bits_to_read(if not prediction_is_IVOP then
// round_type, intra_dc_vlc_thr[3], vop_quant[5], vop_fcode_for[3]1 + INTRA_DC_VLC_THR_LENGTH + BITS_QUANT + VOP_FCODE_FOR_LENGTH
else// intra_dc_vlc_thr[3], vop_quant[5]INTRA_DC_VLC_THR_LENGTH + BITS_QUANT
end );bit_count := bit_count + 1;
end
// int(size=10) resync_marker_length;int(size=VOP_FCODE_FOR_LENGTH+1) fcode;
send_new_vop_cmd: action ==> BTYPE:[ cmd ]guard
done_reading_bits()var
bool round := false,int(size=BTYPE_SZ) cmd := bitor( NEWVOP, if prediction_is_IVOP then INTRA else INTER end ),int(size=BITS_QUANT+1) vop_quant
doif not prediction_is_IVOP then
round := bitand( rshift( read_result(), INTRA_DC_VLC_THR_LENGTH + BITS_QUANT + VOP_FCODE_FOR_LENGTH ), 1 ) = 1;vop_quant := bitand( rshift( read_result(), VOP_FCODE_FOR_LENGTH ), BITS_QUANT_MASK );fcode := bitand( read_result(), VOP_FCODE_FOR_MASK );// resync_marker_length := 16 + fcode;
elsevop_quant := bitand( read_result(), BITS_QUANT_MASK );fcode := 0;// resync_marker_length := 17;
endcmd := bitor( cmd, bitand(vop_quant,QUANT_MASK) );cmd := bitor( cmd, if round then ROUND_TYPE else 0 end );cmd := bitor( cmd, bitand( lshift( fcode, FCODE_SHIFT), FCODE_MASK) );
end
send_new_vop_width: action ==> BTYPE: [ vol_width ] end
send_new_vop_height: action ==> BTYPE: [ vol_height ] end
/********************************************************************************************************************************** start MB **********************************************************************************************************************************/
int CBP_SZ = 7;
int(size=CBP_SZ) cbp;
// Advance the mb coordinates with wrapprocedure next_mbxy()begin
mbx := mbx + 1;if mbx = vol_width then
mbx := 0;mby := mby + 1;
endend
// Go look for next VOPmb_done: action ==>guard
mby = vol_heightend;
get_mcbpc.ivop: action ==>guard
prediction_is_IVOPdo
start_vld_engine( MCBPC_IVOP_START_INDEX );end
get_mcbpc.pvop: action bits:[b] ==>guard
not prediction_is_IVOP,not b
dostart_vld_engine( MCBPC_PVOP_START_INDEX );bit_count := bit_count + 1;
end
// Nothing to do - uncodedmcbpc_pvop_uncoded: action bits:[b] ==>
BTYPE:[ INTER ]guard
not prediction_is_IVOPdo
next_mbxy();bit_count := bit_count + 1;
end
mcbpc_pvop_uncoded1: action ==>BTYPE:[ INTER ]
end
bool acpredflag;bool btype_is_INTRA;int(size=CBP_SZ) cbpc;bool fourmvflag;
get_mbtype.noac: action ==>guard
vld_success(),type != 3,type != 4
varint mcbpc = vld_result(),int type = bitand( mcbpc, 7 )
dobtype_is_INTRA := type >= 3;fourmvflag := (type = 2);cbpc := bitand( rshift( mcbpc, 4 ), 3 );acpredflag := false;start_vld_engine( CBPY_START_INDEX );
end
get_mbtype.ac: action bits:[b] ==> guard
vld_success()var
int mcbpc = vld_result(),int type = bitand( mcbpc, 7 )
dobtype_is_INTRA := true;cbpc := bitand( rshift( mcbpc, 4 ), 3 );acpredflag := b;bit_count := bit_count + 1;start_vld_engine( CBPY_START_INDEX );
end
bool ac_coded;
int(size=4) mvcomp;
final_cbpy_inter: action ==>guard
vld_success(),not btype_is_INTRA
varint cbpy = 15 - vld_result()
docomp := 0;mvcomp := 0;cbp := bitor( lshift( cbpy, 2), cbpc );
end
final_cbpy_intra: action ==>guard
vld_success()var
int cbpy = vld_result()do
comp := 0;cbp := bitor( lshift(cbpy, 2), cbpc );
end
mb_dispatch_done: action ==>guard
comp = 6do
next_mbxy();end
mb_dispatch_intra: action ==> BTYPE:[ cmd ]guard
btype_is_INTRAvar
int(size=BTYPE_SZ) cmd := INTRAdo
ac_coded := bitand( cbp, lshift( 1, 5 - comp)) != 0;cmd := bitor( cmd, if ac_coded then ACCODED else 0 end );cmd := bitor( cmd, if acpredflag then ACPRED else 0 end );
end
mb_dispatch_inter_no_ac: action ==> BTYPE:[ bitor( INTER, bitor( MOTION, if fourmvflag then FOURMV else 0 end ) ) ]guard
bitand( cbp, lshift( 1, 5 - comp)) = 0do
ac_coded := false;comp := comp + 1;
end
mb_dispatch_inter_ac_coded: action ==> BTYPE:[ bitor(bitor( INTER, ACCODED ), bitor( MOTION, if fourmvflag then FOURMV else 0 end ) ) ]do
ac_coded := true;end
vld_start_intra: action ==>guard
btype_is_INTRAdo
start_vld_engine( if comp < 4 then DCBITS_Y_START_INDEX else DCBITS_UV_START_INDEX end );b_last := false;
end
// There are AC coefficientsvld_start_inter.ac_coded: action ==>guard
ac_codeddo
b_last := false;end
// No AC coefficientsvld_start_inter.not_ac_coded: action ==> RUN:[0], VALUE:[0], LAST:[true]do
b_last := true;end
// The Y DC value is at most 12 bits, UV at most 13 bitsint(size=5) dc_bits;
get_dc_bits.none: action ==> RUN:[0], VALUE:[0], LAST:[ not ac_coded ]guard
vld_success(),vld_result() = 0
dob_last := not ac_coded;
end
get_dc_bits.some: action ==>guard
vld_success()do
dc_bits := vld_result();set_bits_to_read( dc_bits );
end
int(size=14) msb_result;
dc_bits_shift: action ==>var
int(size=5) count = dc_bits,int(size=14) shift = 1
dowhile count > 1 do
shift := lshift( shift, 1 );count := count - 1;
endmsb_result := shift;
end
get_dc: action ==> RUN:[0], VALUE:[v], LAST:[ not ac_coded ]guard
done_reading_bits()var
int(size=14) v = read_result()do
if bitand( v, msb_result ) = 0 thenv := v + 1 - lshift( msb_result, 1 );
end set_bits_to_read( if dc_bits > 8 then MARKER_LENGTH else 0 end );b_last := not ac_coded;
end
bool b_last;
block_done: action ==>guard
done_reading_bits(),b_last
docomp := comp + 1;
end
dct_coeff: action ==>guard
done_reading_bits()do
start_vld_engine( if btype_is_INTRA then COEFF_INTRA_START_INDEX else COEFF_INTER_START_INDEX end );end
vld_code: action bits:[sign] ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]guard
vld_success(),val != ESCAPE_CODE
varint(size=VLD_TABLE_DATA_BITS) val = vld_result(),int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool last
dorun := if btype_is_INTRA then bitand( rshift( val, 8), 255)
else bitand( rshift( val, 4), 255) end;last := if btype_is_INTRA then bitand( rshift( val, 16), 1) != 0
else bitand( rshift( val, 12), 1) != 0 end;level := if btype_is_INTRA then bitand( val, 255)
else bitand( val, 15) end;b_last := last;bit_count := bit_count + 1;
end
vld_level: action bits:[level_offset] ==>guard
vld_success(),not level_offset
dobit_count := bit_count + 1;start_vld_engine( if btype_is_INTRA then COEFF_INTRA_START_INDEX else COEFF_INTER_START_INDEX end );
end
vld_run_or_direct: action bits:[level_offset] ==>guard
vld_success()do
bit_count := bit_count + 1;end
vld_run: action bits:[run_offset] ==>guard
not run_offsetdo
bit_count := bit_count + 1;start_vld_engine( if btype_is_INTRA then COEFF_INTRA_START_INDEX else COEFF_INTER_START_INDEX end );
end
vld_direct_read: action bits:[run_offset] ==>do
bit_count := bit_count + 1;set_bits_to_read( 1 + RUN_LENGTH + MARKER_LENGTH + LEVEL_LENGTH + MARKER_LENGTH );
end
vld_direct: action ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]guard
done_reading_bits()var
int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool sign,bool last
dolast := bitand( rshift( read_result(), RUN_LENGTH + MARKER_LENGTH + LEVEL_LENGTH + MARKER_LENGTH ), 1 ) != 0;run := bitand( rshift( read_result(), MARKER_LENGTH + LEVEL_LENGTH + MARKER_LENGTH ), RUN_MASK );level := bitand( rshift( read_result(), MARKER_LENGTH ), LEVEL_MASK );if level >= 2048 then
sign := true;level := 4096 - level;
elsesign := false;
endb_last := last;
end
function intra_max_level( bool last, int run) :if not last then
if run = 0 then 27 elseif run = 1 then 10 else
if run = 2 then 5 elseif run = 3 then 4 else
if run < 8 then 3 elseif run < 10 then 2 else
if run < 15 then 1 else 0 endend
endend
endend
endelse
if run = 0 then 8 elseif run = 1 then 3 else
if run < 7 then 2 elseif run < 21 then 1 else 0 end
endend
endend
end
function inter_max_level( bool last, int run) :if not last then
if run = 0 then 12 elseif run = 1 then 6 else
if run = 2 then 4 elseif run < 7 then 3 else
if run < 11 then 2 elseif run < 27 then 1 else 0 end
endend
endend
endelse
if run = 0 then 3 elseif run = 1 then 2 else
if run < 41 then 1 else 0 endend
endend
end
int(size=SAMPLE_SZ) level_lookup_inter;int(size=SAMPLE_SZ) level_lookup_intra;
do_level_lookup: action ==>guardvld_success()
varint(size=VLD_TABLE_DATA_BITS-2) val = vld_result()
dolevel_lookup_inter := inter_max_level( bitand( rshift( val, 12), 1) != 0, bitand( rshift( val, 4), 255));level_lookup_intra := intra_max_level( bitand( rshift( val, 16), 1) != 0, bitand( rshift( val, 8), 255));
end
vld_level_lookup: action bits:[sign] ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]var
int(size=VLD_TABLE_DATA_BITS-2) val = vld_result(),int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool last
dorun := if btype_is_INTRA then bitand( rshift( val, 8), 255)
else bitand( rshift( val, 4), 255) end;last := if btype_is_INTRA then bitand( rshift( val, 16), 1) != 0
else bitand( rshift( val, 12), 1) != 0 end;level := if btype_is_INTRA then bitand( val, 255) + level_lookup_intra
else bitand( val, 15) + level_lookup_inter end;b_last := last;bit_count := bit_count + 1;
end
function intra_max_run( bool last, int level) :if not last then
if level = 1 then 14 elseif level = 2 then 9 else
if level = 3 then 7 elseif level = 4 then 3 else
if level = 5 then 2 elseif level < 11 then 1 else 0 end
endend
endend
endelseif level = 1 then 20 else
if level = 2 then 6 elseif level = 3 then 1 else 0 end
endend
endend
function inter_max_run( bool last, int level) :if not last then
if level = 1 then 26 elseif level = 2 then 10 else
if level = 3 then 6 elseif level = 4 then 2 else
if level = 5 or level = 6 then 1 else 0 endend
endend
endelse
if level = 1 then 40 elseif level = 2 then 1 else 0 end
endend
end
int(size=SAMPLE_SZ) run_lookup_inter;int(size=SAMPLE_SZ) run_lookup_intra;
// Do lookup both ways do_run_lookup: action ==>guard
vld_success()var
int(size=VLD_TABLE_DATA_BITS-2) val = vld_result()do
run_lookup_inter := inter_max_run( bitand( rshift( val, 12), 1) != 0,bitand( val, 15) );
run_lookup_intra := intra_max_run( bitand( rshift( val, 16), 1) != 0, bitand( val, 255) );
end
vld_run_lookup: action bits:[sign] ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]var
int(size=VLD_TABLE_DATA_BITS-2) val = vld_result(),int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool last
dolast := if btype_is_INTRA then bitand( rshift( val, 16), 1) != 0
else bitand( rshift( val, 12), 1) != 0 end; level := if btype_is_INTRA then bitand( val, 255)
else bitand( val, 15) end; run := if btype_is_INTRA then bitand( rshift( val, 8), 255) + run_lookup_intra
else bitand( rshift( val, 4), 255) + run_lookup_inter end + 1; b_last := last;bit_count := bit_count + 1;
end
/********************************************************************************************************************************** Motion Decode **********************************************************************************************************************************/
mvcode_done: action ==>guard
mvcomp = 4 or (mvcomp = 1 and not fourmvflag )end
mvcode: action ==>do
start_vld_engine( MV_START_INDEX );end
mag_x: action ==> MV:[ mvval ]guard
vld_success()var
int (size=VLD_TABLE_DATA_BITS) mvval = vld_result()do
set_bits_to_read( if fcode <= 1 or mvval = 0 then 0 else fcode end );end
get_residual_x: action ==> MV:[ read_result() ]guard
done_reading_bits()end
mag_y: action ==> MV:[ mvval ]guard
vld_success()var
int (size=VLD_TABLE_DATA_BITS) mvval = vld_result()do
set_bits_to_read( if fcode <= 1 or mvval = 0 then 0 else fcode-1 end );end
get_residual_y: action ==> MV:[ read_result() ]guard
done_reading_bits()do
mvcomp := mvcomp + 1;end
int VLD_TABLE_ADDR_BITS = 12;int VLD_TABLE_DATA_BITS = 20;
list( type:int( size=VLD_TABLE_DATA_BITS ), size=760 ) vld_table = [
// Automatically-generated tables for bitwise MPEG-4 VLD//// Decoding proceeds as follows:// 1. Set index to a starting value for the desired code (see embedded comments).// 2. Read the next bit in the incoming stream.// 3. Fetch the value at table[ index + bit ]// 4. Take the following action based on table value:// 4.1. If lsb = 1, terminate decoding with illegal codeword error.// 4.2. If 2nd lsb = 1, the codeword is not complete,// so set index to value >> 2, go to step 2.// 4.3. For all other values stop decoding and return value >> 2
// start index for MCBPC_IVOP is 0// (cumulative table size is 16 words x 20 bits)
10, 12, 18, 58, 26, 76, 34, 16,42, 50, 1, 80, 144, 208, 140, 204,
// start index for MCBPC_PVOP is 16// (cumulative table size is 58 words x 20 bits)
74, 0, 82, 226, 90, 218, 98, 202,106, 178, 114, 162, 122, 146, 130, 138,1, 1, 208, 144, 154, 140, 80, 196,170, 204, 76, 200, 186, 194, 136, 72,132, 68, 210, 12, 16, 192, 128, 64,8, 4,
// start index for CBPY is 58// (cumulative table size is 92 words x 20 bits)
242, 338, 250, 314, 258, 298, 266, 290,274, 282, 1, 1, 24, 36, 32, 16,306, 0, 8, 4, 322, 330, 48, 40,56, 20, 346, 60, 354, 362, 52, 12,44, 28,
// start index for DCBITS_Y is 92// (cumulative table size is 118 words x 20 bits)
378, 466, 386, 458, 394, 16, 402, 20,410, 24, 418, 28, 426, 32, 434, 36,442, 40, 450, 44, 1, 48, 12, 0,8, 4,
// start index for DCBITS_UV is 118// (cumulative table size is 144 words x 20 bits)
482, 570, 490, 8, 498, 12, 506, 16,514, 20, 522, 24, 530, 28, 538, 32,546, 36, 554, 40, 562, 44, 1, 48,4, 0,
// start index for COEFF_INTER is 144// (cumulative table size is 380 words x 20 bits)
586, 1498, 594, 1426, 602, 1338, 610, 1194,618, 1066, 626, 874, 634, 818, 642, 794,650, 770, 658, 714, 666, 690, 674, 682,1, 1, 1, 1, 698, 706, 1, 1,1, 1, 722, 746, 730, 738, 1, 1,1, 1, 754, 762, 1, 1, 1, 1,778, 786, 16456, 16396, 44, 40, 802, 810,18180, 18116, 18052, 17988, 826, 850, 834, 842,584, 520, 456, 392, 858, 866, 328, 204,140, 80, 882, 28668, 890, 946, 898, 922,906, 914, 48, 84, 1476, 1540, 930, 938,18244, 18308, 18372, 18436, 954, 1010, 962, 986,970, 978, 88, 144, 268, 332, 994, 1002,396, 648, 1604, 1668, 1018, 1042, 1026, 1034,18500, 18564, 18628, 18692, 1050, 1058, 18756, 18820,18884, 18948, 1074, 1138, 1082, 1114, 1090, 1106,1098, 17924, 36, 32, 17860, 17796, 1122, 1130,17732, 17668, 17604, 17540, 1146, 1170, 1154, 1162,17476, 16392, 1412, 1348, 1178, 1186, 1284, 1220,1156, 1092, 1202, 1282, 1210, 1258, 1218, 1242,1226, 1234, 1028, 964, 264, 200, 1250, 17412,28, 24, 1266, 1274, 17348, 17284, 17220, 17156,1290, 1314, 1298, 1306, 17092, 17028, 16964, 900,1322, 1330, 836, 136, 76, 20, 1346, 1402,1354, 1378, 1362, 1370, 16900, 16836, 16772, 16708,1386, 1394, 772, 708, 644, 16, 1410, 1418,16644, 16580, 16516, 16452, 1434, 1482, 1442, 1466,1450, 1458, 580, 516, 452, 388, 1474, 324,72, 12, 1490, 16388, 260, 196, 4, 1506,68, 1514, 132, 8,
// start index for COEFF_INTRA is 380// (cumulative table size is 616 words x 20 bits)
1530, 2442, 1538, 2370, 1546, 2282, 1554, 2138,1562, 2010, 1570, 1818, 1578, 1762, 1586, 1738,1594, 1714, 1602, 1658, 1610, 1634, 1618, 1626,1, 1, 1, 1, 1642, 1650, 1, 1,1, 1, 1666, 1690, 1674, 1682, 1, 1,1, 1, 1698, 1706, 1, 1, 1, 1,1722, 1730, 262172, 262168, 88, 84, 1746, 1754,264200, 263180, 262164, 13316, 1770, 1794, 1778, 1786,5132, 8200, 4108, 3088, 1802, 1810, 2064, 1052,80, 76, 1826, 28668, 1834, 1890, 1842, 1866,1850, 1858, 92, 96, 1056, 9224, 1874, 1882,265224, 266248, 277508, 278532, 1898, 1954, 1906, 1930,1914, 1922, 100, 104, 108, 1060, 1938, 1946,6156, 1064, 2068, 7180, 1962, 1986, 1970, 1978,14340, 262176, 267272, 268296, 1994, 2002, 279556, 280580,281604, 282628, 2018, 2082, 2026, 2058, 2034, 2050,2042, 276484, 72, 68, 275460, 274436, 2066, 2074,273412, 272388, 263176, 262160, 2090, 2114, 2098, 2106,12292, 11268, 7176, 6152, 2122, 2130, 5128, 3084,2060, 1048, 2146, 2226, 2154, 2202, 2162, 2186,2170, 2178, 1044, 64, 4104, 60, 2194, 270340,56, 52, 2210, 2218, 269316, 268292, 262156, 10244,2234, 2258, 2242, 2250, 9220, 8196, 271364, 3080,2266, 2274, 1040, 48, 44, 40, 2290, 2346,2298, 2322, 2306, 2314, 266244, 265220, 6148, 267268,2330, 2338, 7172, 2056, 1036, 36, 2354, 2362,262152, 5124, 264196, 263172, 2378, 2426, 2386, 2410,2394, 2402, 4100, 3076, 32, 28, 2418, 2052,1032, 24, 2434, 262148, 20, 16, 4, 2450,8, 2458, 1028, 12,
// start index for MV is 616// (cumulative table size is 760 words x 20 bits)
/* BEGINCOPYRIGHT X
Copyright (c) 2007, Xilinx Inc.All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:- Redistributions of source code must retain the above
copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
ENDCOPYRIGHT*/
import all caltrop.lib.BitOps;
actor ForemanSource () int(size=8) MPEG, int(size=8) COMPARE ==> int(size=8) dataout, int(size=32) errors :
int mismatch := 0;int(size=24) count := 0;
action MPEG:[mpeg], COMPARE:[comp] ==> dataout:[ mpeg ]guard
mismatch = 0do
mismatch := bitxor(mpeg, comp);count := count + 1;
end
action ==> errors:[ bitor(lshift(count, 8), bitand(old mismatch, 0xFF)) ]guard
mismatch != 0do
mismatch := 0;end
end
Compare23 lines(without header comments)
ParseHeaders1320 lines(without header comments)
http://opendf.svn.sourceforge.net/viewvc/opendf/models/MPEG4_SP_Decoder/
![Page 23: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/23.jpg)
Actor/Dataflow Programming Model
encapsulated state
Actions
Schedule State
point-to-point, buffered token-passing connections
actors guarded atomic actions
dataflowI/O interfaces
class MyActor{ schedule();readPort( portNum );writePort( portNum );
}
simulation
software
hardware
actor source+ network
high-level synthesis
autonomousschedule
![Page 24: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/24.jpg)
Static scheduling: token rates
actor Add () Input1, Input2 ==> Output:
action [a], [b] ==> [a + b] endend
actor AddSeq () Input ==> Output:
action [a, b] ==> [a + b] endend
2 1
11
1
![Page 25: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/25.jpg)
Static scheduling: token rates
actor Split () Input ==> Pos, Neg:
action [a] ==> Pos: [a]guard a >= 0 end
action [a] ==> Neg: [a]guard a < 0 end
end
actor Multiplex () S, A, B ==> Output:
action S: [sel], A: [v] ==> [v]guard sel end
action S: [sel], B: [v] ==> [v]guard not sel end
end
?1
?
?
1?
1
![Page 26: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/26.jpg)
26
• SDF: static, fixed token exchange rates, compile-time analysis
• CSDF: extension of SDF with the notionof state, the same compile-timeproperties
• Dynamic extensions: different aspects of dynamicbehavior, run-time changes of the topology or token exchange rates
• DDF: firing rules and firing functionscan be data-dependent, run-time analysis only
Dataflow programs: MoCs
DDF
Dynamic extensions to SDF
CSDF
SDF
IDF
BDF
PSDF
SADF
SPDFBPDF
TPDF
![Page 27: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/27.jpg)
27
• SDF: static, fixed token exchange rates, compile-time analysis
• CSDF: extension of SDF with the notionof state, the same compile-timeproperties
• Dynamic extensions: different aspects of dynamicbehavior, run-time changes of the topology or token exchange rates
• DDF: firing rules and firing functionscan be data-dependent, run-time analysis only
Dataflow programs: MoCs
DDF
Dynamic extensions to SDF
CSDF
SDF
IDF
BDF
PSDF
SADF
SPDFBPDF
TPDF
ATS
![Page 28: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/28.jpg)
Static scheduling conundrum
Synchronous Dataflow(SDF)
Dataflow
static token rates arbitrary token rates, may change over time, be data dependent
static schedule, static buffer analysis, guaranteed deadlock-free
no static schedule, no hard buffer-bounds at compile time, deadlock undecidable
efficient implementation due to compile-time analysis, low or zero runtime overhead
runtime scheduling, requires runtime decision making
![Page 29: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/29.jpg)
Static scheduling
Synchronous Dataflow
(SDF)
Datafloww/
CALDataflow
static token ratesarbitrary token rates, may change over time, be data dependent
arbitrary token rates, may change over time, be data dependent
static schedule, static buffer analysis, guaranteed deadlock-free
partial static scheduling: statically schedulable when possible
no static schedule, no hard buffer-bounds at compile time, deadlock undecidable
efficient implementation due to compile-time analysis, low or zero runtime overhead
adaptive implementation: use compile-time information, make dynamic choices at runtime
runtime scheduling, requires runtime decision making
![Page 30: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/30.jpg)
Static dataflow scheduling
A
CB
ACABC
112
2 1
1 12
statically schedulable networks desirableefficient software implementationsno deadlocks, yet small buffersreduced synchronization overhead (in hardware).
Real-world dataflow networks are seldom completely statically schedulable.
Use internal actor structure to identify analyzable regions.
![Page 31: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/31.jpg)
31
State-of-the-art: (dataflow) parallel implementations
high
low
expressiveness facility of developing efficient implementations
analyzability
DDF
TPDF
BPDFSPDFSADFPSDF
IDFBDF
SDFCSDF
![Page 32: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/32.jpg)
32
State-of-the-art: (dataflow) parallel implementations
high
low
expressiveness facility of developing efficient implementations
analyzability
DDF
TPDF
BPDFSPDFSADFPSDF
IDFBDF
SDFCSDF
![Page 33: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/33.jpg)
33
State-of-the-art: (dataflow) parallel implementations
high
low
expressiveness facility of developing efficient implementations
analyzability
DDF
TPDF
BPDFSPDFSADFPSDF
IDFBDF
SDFCSDF
![Page 34: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/34.jpg)
Sequential versus Dataflow
![Page 35: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/35.jpg)
CAL MPEG-4 SP DECODER
Design Unit MPEG-4 SP CAL lines of
code
MPEG-4 Part 7 (Optimized reference SW) lines of C code
MPEG-4 SP (Optimized
Proprietary SW) lines of C code
Parser (2 actors)
1585
AC/DC Reconstruction (3 actors)
468
2D-IDCT (2 actors)
52
Motion Compensation (4 actors)
631
TOTALS 2736 31452 (.src + .h files) 6115 with display control and other tools (.src + .h files)
![Page 36: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/36.jpg)
Dataflow programing framework
Open RVC-CAL Compiler (Orcc)Flexible compilation infrastructureSource-to-source: Hardware and/or software codeMulti-target: GPP (x86, ARM, i7, ….), FPGA, …
.ir
actor.cal
Front-endSource code
Targetcompiler / synthesizer
Back-ends Build script
Intermediaterepresentation
![Page 37: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/37.jpg)
Cal Actor Language: Tool infrastructure
FrontendIntermediate representation simpler than the direct AST
Backends TransformationsClassification, SSA transformation, 3AC, repeat transformation…Interpreter, Profiling analysis (EPFL)Quasi static analysis (Oulu)Code generation: actor instantiationC / C++Verilog (Xilinx)C – Vivado HLSTTA (network + actor in pure LLVM code generation) (IRISA)Promela backend (Abo)Code generation: actor as classLLVM (LLVM + metadata)
![Page 38: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/38.jpg)
Code generation with Orcc
A1
A2B
J
LLVM
JADELLVM JIT compilerA.bc B.bc
decoder.xdf
A2.c B.cGcc
Decoder.exe
A1.c
decoder.c
J
Java Virtual Machine
A.class B.class
HotspotJava JIT
compiler
decoder.javaOrcc
![Page 39: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/39.jpg)
39
CAL HLS Tool Flow based on Xronos
Xronos HLS Tool flow
AST
IR
OR
CC
IR Transform
ations
Parser
ORCCORCC
XRONOS
OR
CC
IR H
DL
Transformations
OR
CC
IR TO
LIM IR
LIM IR
Optim
izationsA
nd Scheduling
VerilogGenerator
Testbench
Generator
ReportGenerator
OpenForge
Xronos offers a full support of the RVC-CAL programming language
CAL HDL
• Xronos analyzes the source and determines:• Inputs & Outputs• Operations• Data dependencies
• A control/dataflow graph (CDFG) is constructed:• With no timing information• Represents the data flow for the functionality
of the design
• The graph shows data dependencies:• Operations that must complete before others
can begin• e.g (a*b) must be complete before (a*b)+c
• The graph shows control dependencies:• If/else constructs will turn into mux operators• Loops constructs to a set of @always
![Page 40: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/40.jpg)
40
Xronos HLS – OpenForge Optimizations
Xronos HLS Optimizations
Memory Optimizations:
• Dual port memory conversion, one clock for read or write• Memory ROM reduction, reduces the allocated size of each element• Memory merging, reduces size by grouping memories• Memory resizing, resizes the length of a memory to a power of 2 • Read only field reducing, ROM to constant• ROM replicating, reduces the clock access by replicating the ROM
Control Flow Optimization:
• Loop unrolling, unrolls loops• Pipelining, inserting registers to break the combinatorial path into multiple stages
![Page 41: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/41.jpg)
41
Xronos HLS – Actors I/Os
Xronos HLS Actor IO
actor ACTOR() int IN ==> int OUT:…
end
Each input and output of an actor is transformed into a double handshake interface
Actor
IN_DATA (31:0)
IN_SEND
IN_ACK
IN_RDY
OUT_DATA (31:0)
OUT_SEND
OUT_ACK
OUT_RDYCLK
RESET
![Page 42: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/42.jpg)
42
Xronos HLS – Dataflow Application
A
IN_A_DATA
IN_A_SEND
IN_A_ACK
IN_A_RDY
OUT_A_DATA
OUT_A_SEND
OUT_A_ACK
OUT_A_RDY
F_A
IN_DATA
IN_SEND
IN_ACK
IN_RDY
OUT_DATA
OUT_SEND (1:0)
OUT_ACK (1:0)
OUT_RDY(1:0)
Q_B
IN_DATA
IN_SEND
IN_ACK
IN_RDY
OUT_DATA
OUT_SEND
OUT_ACK
OUT_RDY
Q_C
IN_DATA
IN_SEND
IN_ACK
IN_RDY
OUT_DATA
OUT_SEND
OUT_ACK
OUT_RDY
B
IN_B_DATA
IN_B_SEND
IN_B_ACK
IN_B_RDY
OUT_B_DATA
OUT_B_SEND
OUT_B_ACK
OUT_B_RDY
C
IN_C_DATA
IN_C_SEND
IN_C_ACK
IN_C_RDY
OUT_C_DATA
OUT_C_SEND
OUT_C_ACK
OUT_C_RDY
Xronos HLS Network of Actors
• Each input port is connected to a queue.
• If an output port feeds more the one input port a connection with appropriate fanout is inserted.
![Page 43: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/43.jpg)
43
Xronos HLS – Inside an Actor
a0
GO DONE
R_Counter W_Counter
a0_OUT
a0_IN OUT_DATA
a1
GO DONE
R_Counter W_Counter
a1_IN
IN_DATA
a1_Value OUT
a0_Value
Counter
Scheduler
a0_GO
a1_GOa1_DONE
a0_DONE
IN_SEND
OUT_RDY
OUT_RDY
IN_SEND
OUT_SEND
IN_ACKa1_OUT_SENDa1_OUT_DATA
a1_IN_ACK
a0_OUT_SENDa0_IN_ACK
IN_DATA
Xronos HLS Modules of an Actor
actor CondCounter() int IN ==> int OUT:int Counter := 0;a0: action In:[token] ==> OUT:[Counter]guard token >= 0do
Counter := Counter + 1;end
a1: action In:[token] ==> OUT:[Counter]guard token < 0do
Counter := Counter + 2;end…schedule fsm s0:
s0 (a0) s1;s1 (a1) s2;…
endend
![Page 44: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/44.jpg)
44
Xronos HLS – Inside the Actor scheduler
Scheduler
a0_GO
a1_GOa1_DONE
a0_DONE
OUT_RDY
IN_SEND
Guard a0 (token => 0)
TokenOutput Available
Result
Guard a1 (token < 0)
TokenOutput Available
Result
Action Scheduler
a0_GO
a1_GO
a0_Guard
a1_Guard
a0_DONEa1_DONER_State W_State
Scheduler State
Value OUT
Xronos HLS Actor Scheduler
IN_DATA
Output Available
Output Available
![Page 45: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/45.jpg)
45
Xronos HLS – Clock Domains
A
IN_A_DATA
IN_A_SEND
IN_A_ACK
IN_A_RDY
OUT_A_DATA
OUT_A_SEND
OUT_A_ACK
OUT_A_RDY
Q_B
IN_DATA
IN_SEND
IN_ACK
IN_RDY
OUT_DATA
OUT_SEND
OUT_ACK
OUT_RDY
B
IN_B_DATA
IN_B_SEND
IN_B_ACK
IN_B_RDY
OUT_B_DATA
OUT_B_SEND
OUT_B_ACK
OUT_B_RDYCLK CLKA CLKB CLK
CLKF1 CLKF2
Xronos HLS Clock Domains
Multi-Clock Domain Support• The user may attribute (graphically or by a file) a different clock domain for each actor• A Multi-Clock queue is inserted• Power reduction with multiple clock domains and clock gating techniques
![Page 46: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/46.jpg)
46
Xronos HLS – Testbench Generation
HLS Testbenches
Automatic testbench generation:• Only the input and output tokens, generated by the CAL simulator
or a synthesized C code is needed• Error detection, simulation stops and shows in which token the
error has occurred• VHDL testbench for each actor and network
Input files generated for “Modelsim”:• TCL file for each actor and network that:
- Adds all inputs and outputs of the actor or network (in case of actor simulation or network simulation)
- Adds the “Go” and “Done” signals for each action
![Page 47: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/47.jpg)
47
Xronos HLS – Testbench Generation
Xronos HLS Testbenches
Network I/O
Actor I/O
Go & Done of Actions
• Comparison with Gold Reference• Error detection
![Page 48: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/48.jpg)
CAL DATAFLOW THE DESIGN FLOW
CAL DATAFLOW PROGRAMMING(ORCC)
DESIGN SPACE EXPLORATION(TURNUS)
HDL SYNTHESIS (XRONOS)
SW SYNTHESIS (ORCC backends)
Constraints
CAL library
CAL Program
Functional Verification
Design Space Exploration
Interface Synthesis
Architecture Model
Mapping {Partitioning
and Scheduling}
CAL Profiling
Performance Analysis Data and
Refactoring Directions
Refactored CAL Program
Hardware Code
Generation
Software Code
Generation
HDL Library
RTL Synthesis Compilation
RTL Implementation
SW Executable
Runtime Performance
Data
Refactoring Decisions
Iterations
?
![Page 49: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/49.jpg)
The structure of a computation
actor F () A ==> X:
action A: [a] ==> X: [f(a)] endend
actor Src () ==> X:
s := 0;
action ==> X: [s] do
s := h(s);end
end
Src
actor Snk () A ==> X:
action A: [a] ==> endend
F
Snk
Src
F
Snk
Src
F
Snk
actor F () A ==> X:
x := 0;
action A: [a] ==> X: [f(x)] do
x := g(a, x);end
end
![Page 50: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/50.jpg)
50
8x8 2D IDCT
Profiling: traces of a decoder
parsing, then IDCT
tail end of reconstruction, followed by IDCT
![Page 51: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/51.jpg)
51
Actor CopyToken X ==> Zcounter = 0;copy: action X[x] ==> Z[z]
docounter = counter + 1;z = x;
endend
Actor Generator ==> Ocounter = 0;gen: action ==> O[counter]
guard counter < 3do
counter = counter + 1;end
end
Actor Consumer X, Y ==>counter = 0;cons: action X[x], Y[y] ==>
docounter = counter + 1;
endend
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
Execution Trace Graph generation
![Page 52: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/52.jpg)
52
• Partitioning: all actors in one processing component• Scheduling: pre-defined static scheduling• Buffer dimensioning: infinite buffer sizes
Example 1)
![Page 53: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/53.jpg)
53
A
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
BCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Example 1)
![Page 54: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/54.jpg)
54
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:1
Example 1)
![Page 55: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/55.jpg)
55
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:1
Example 1)
![Page 56: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/56.jpg)
56
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: C:copy:1
Example 1)
![Page 57: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/57.jpg)
57
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: D:copy:1
Example 1)
![Page 58: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/58.jpg)
58
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: E:copy:1
Example 1)
![Page 59: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/59.jpg)
59
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: F:copy:1
Example 1)
![Page 60: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/60.jpg)
60
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: G:cons:1
Example 1)
![Page 61: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/61.jpg)
61
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
At the end of the execution
Example 1)
![Page 62: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/62.jpg)
62
• Partitioning: each actor in a separate processing component• Scheduling: pre-defined static scheduling• Buffer dimensioning: infinite buffer sizes
Example 2)
![Page 63: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/63.jpg)
63
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Example 2)
![Page 64: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/64.jpg)
64
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:1
Example 2)
![Page 65: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/65.jpg)
65
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:2, B:copy:1, D:copy:1
Example 2)
![Page 66: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/66.jpg)
66
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:3, B:copy:2, C:copy:1D:copy:2, E:copy:1
Example 2)
![Page 67: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/67.jpg)
67
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:3, C:copy:2, D:copy:3E:copy:2, F:copy:1
Example 2)
![Page 68: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/68.jpg)
68
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: C:copy:3, E:copy:3, F:copy:3G:cons:1
Example 2)
![Page 69: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/69.jpg)
69
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: F:copy:3, G:cons:2
Example 2)
![Page 70: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/70.jpg)
70
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: G:cons:3
Example 2)
![Page 71: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/71.jpg)
71
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
At the end of the execution
Example 2)
![Page 72: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/72.jpg)
72
• Partitioning: each actor in a separate processing component• Scheduling: pre-defined static scheduling• Buffer dimensioning: minimal buffer sizes (1)
Example 3)
![Page 73: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/73.jpg)
73
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Example 3)
![Page 74: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/74.jpg)
74
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:1
Example 3)
![Page 75: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/75.jpg)
75
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:1, D:copy:1
FIFO FULL!
Example 3)
![Page 76: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/76.jpg)
76
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:1, D:copy:1
Example 3)
![Page 77: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/77.jpg)
77
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:2, C:copy:1, E:copy:1
Example 3)
![Page 78: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/78.jpg)
78
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
FIFO FULL!
Example 3)
![Page 79: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/79.jpg)
79
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:2, D:copy:2, F:copy:1
Example 3)
![Page 80: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/80.jpg)
80
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:2, D:copy:2, F:copy:1
FIFO FULL!
Example 3)
![Page 81: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/81.jpg)
81
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:3, E:copy:2, G:cons:1
Example 3)
![Page 82: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/82.jpg)
82
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: A:gen:3, E:copy:2, G:cons:1FIFO
FULL!
Example 3)
![Page 83: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/83.jpg)
83
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: C:copy:2, D:copy:3, F:copy:2
Example 3)
![Page 84: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/84.jpg)
84
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: B:copy:3, E:copy:3, G:cons:2
Example 3)
![Page 85: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/85.jpg)
85
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: C:copy:3, F:copy:3
Example 3)
![Page 86: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/86.jpg)
86
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
Firing Step: G:cons:3
Example 3)
![Page 87: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/87.jpg)
87
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
ABCDEFG
Execution Gantt ChartExecution Trace Graph
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
At the end of the execution
Example 3)
![Page 88: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/88.jpg)
88
Example 2 (infinite buffer size)
ABCDEFG
ABCDEFGExample 3 (minimal buffer size)
Firings A
Firings B Firings C
Firings D Firings E Firings F
Firings G
AGenerator
BCopyToken
CCopyToken
DCopyToken
ECopyToken
FCopyToken
GConsumer
The same resulting trace, different execution times
![Page 89: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/89.jpg)
89
Execution trace graph
dataflow network: MPEG4-SP decoder17 actors, 38 buffers
ETG for a few frames of input stimulus176 649 nodes, 1 609 543 arcs
• Structural model of a dynamic execution• Abstract from time• Model for Design Space Exploration• Generated for a specific input stimulus• Carries the information about the potential parallelism
![Page 90: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/90.jpg)
90
Performance EstimationExecution Trace Post-Processing
• For each fired action (i.e. node of the execution trace) its Computational Load is measured• The dependencies set is extended according to the partitioning/scheduling configuration
Weighted Execution Trace
time
Vi
Vi Computational Load
write/read delay
action selection execution
The Execution Trace Performance Estimation
![Page 91: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/91.jpg)
91
Execution Trace Critical Path
The Critical Path Definition
Critical Path Algorithm O(n)Determine the design bottlenecks for improving the final throughput• Most Critical Actors and Actions• Most Critical Buffers (only for bounded buffer configuration analysis)
Definition – Critical path:the longest time-weighted sequence of events from the start of the program to its termination.
![Page 92: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/92.jpg)
92
Critical Path Metrics
The Critical Path Metrics and Analysies
Critical Path Analysis Techniques:• Critical Actions/Actors Ranking• Impact Analysis• Critical Path Gradient (i.e. refactoring directions)
NOTE: these metrics are used to estimate the possible design improvements obtained reducing the algorithmic complexity of an action.
timeVi
Vi Computational Load
write delay
action selection execution
![Page 93: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/93.jpg)
93
The Co-Exploration Framework
TURNUS Functionality Overview
DATAFLOW DESIGNEXPLORATIONCAL Profiler
Causation Trace Analysis and Post-Processing (Heuristics)
Computational Load
Critical PathMeasurement
Buffer SizeMin. & Optim.
MCD Partitioning
SW and HDL Code Synthesis (e.g. C/C++, C STHorm, Xronos HDL)
CAL Dataflow Program Architecture Model
Design Mapping:Partitioning, Scheduling Buffer Dimensioning
Cycle-Accurate Profiling Information
SW-HW Implementation
Causation Trace
![Page 94: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/94.jpg)
94
Experimental Results: Design Case
CP Analyses Scenario
Application:MPEG4-SP decoder, RVC-CAL
Platforms:STMicroelectronics STHormVirtex-5 FPGA
Input Sequences:4 different 10-frame QCIF bit streams (Akyio, Foreman, Suzie, News)
Execution Trace:Steps: ≈ 106
Dependencies: ≈ 107
![Page 95: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/95.jpg)
95
Critical Path Metrics
CP Analyses Actors and Actions Ranking
Critical Actions Ranking: • Actions (and actors) are ranked by their Critical Path Contribution (CPC)
![Page 96: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/96.jpg)
96
Critical Path Metrics
CP Analyses Impact Analysis
Impact Analysis: Which action need attention
Most Critical Action
different input stimulus
![Page 97: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/97.jpg)
97
Critical Path Metrics
CP Analyses Impact Analysis
Impact Analysis: Which action need attention
the maximum CL reduction that reduce the CP lengthoptimizing only this action
![Page 98: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/98.jpg)
98
Buffer Dimensioning Problem: Formulation
Buffer Dimensioning Formulation
Objective: find a trade-off between the buffer size and the critical path length Minimum buffer size configuration for a given throughput
Buffer Size
Criti
cal P
ath
Leng
th
DeadlockRegion
min. Buffers Size
CP length with unbounded buffer size
CP length with min. buffer size
![Page 99: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/99.jpg)
99
Heuristic Approach
Buffer Dimensioning Heuristics
Divide the problem onto two sub-problems:1. Compute the minimal buffer size2. Optimize only critical buffers (i.e. the buffers that blocks actions in the current CP)
Buffer Size
Criti
cal P
ath
Leng
th
DeadlockRegion
min. Buffers Size
CP length with unbounded buffer size
CP length with min. buffer size
![Page 100: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/100.jpg)
100
Critical Path Analysis
Buffer Dimensioning
Highlight the Critical Buffer:the one that introduces the highest total blocking latency time along the Critical Path
Highlight Critical Buffers along the Critical Path
time
Vi
Vi Computational Load
write delay
action selection execution
CP Analysis
![Page 101: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/101.jpg)
101
TURNUS Estimated Results
reasonable buffer size configuration
the CP length is not significantly reduced
Buffer Dimensioning Estimated Results
![Page 102: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/102.jpg)
102
Estimated Results v.s. Implementation Results
Buffer Dimensioning Estimated vs Implementation
![Page 103: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/103.jpg)
103
Design case: MPEG-4 AVC/H.264 decoder
• Initial performance summary
Design space exploration
MPEG-4 video decoders
![Page 104: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/104.jpg)
104
MPEG-4 AVC/H.264 decoder• Decoder_Y design points summary
Design space exploration
MPEG-4 video decoders
![Page 105: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/105.jpg)
105
MPEG-4 AVC/H.264 decoder
Design space exploration
MPEG-4 video decoders
![Page 106: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/106.jpg)
106
Case study-2: MPEG-4 AVC/H.264 decoder (cont.)
• Decoder_Y Pareto analysis
• Throughput optimal design points for resource (slice LUT) and power (frequency)
Design space exploration
MPEG-4 video decoders
![Page 107: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/107.jpg)
107
Case study-2: MPEG-4 AVC/H.264 decoder (cont.)
•Design points comparison (original vs best)• ~18% code change for ~13x throughput gain• Less than ~5% code rewriting, essentially “cut and paste”
Design space exploration
MPEG-4 video decoders
![Page 108: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/108.jpg)
108
Case study-2: MPEG-4 AVC/H.264 decoder (cont.)
• Comparison with related works
• Compares favorably with similar works• Rapidly obtain design alternatives in exploration space
Design space exploration
MPEG-4 video decoders
[58] FastVdo http://fastvdo.com/FV264. FV264-H.264/AVC ASIC IP CORE.[59] CoreEL Technologies http://www.coreel.com/pages/productsDigitalVideoH264CBPDecode.aspx, H.264 CBP Decoder.[114] M. Thadani et al. “ESL flow for a hardware H.264/AVC decoder using TLM-2.0 and high level synthesis: a quantitativestudy.” In Proceedings of SPIE - The International Society for Optical Engineering, 2009.[119] S. . Wang, et al. “A software-hardware co-implementation of MPEG-4 Advanced Video Coding (AVC) decoder withblock level pipelining.” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 2005..
![Page 109: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/109.jpg)
109
Multi Clock Domain Partitioning
Introduction MCD Partitioning
Objective: partitioning the dataflow applications onto a MCD architectures• Reduce the overall power consumption• Without impacting the (nominal) performances
ck3
ck2
ck2
ck1
ck1
ck1 ck3
Partitioning Example
![Page 110: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/110.jpg)
110
Multi Clock Domain Partitioning
MCD Partitioning Problem Formulation
Objective: partitioning the dataflow applications onto a MCD architectures• Reduce the overall power consumption• Without impacting the (nominal) performances
Find a mapping configuration Actor Clock Domain (i.e. clock frequency)
such that: we simplify the relation power - frequency with a linear function
we define the performanceconstraint in terms ofCritical Path Length
![Page 111: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/111.jpg)
111
Multi Clock Domain Partitioning
MCD Partitioning Analytical
Idea: slow down only actors outside the critical path
• The critical path length defines the longest serial part of the program execution• The execution time of an action is directly related to its clock frequency
Stretch the graph in order to have all the parallel path with the same length
![Page 112: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/112.jpg)
112
Multi Clock Domain Partitioning
MCD Partitioning Why Dataflow?
Different kinds of Arcs:
• Action Firings Dependencies
• Actions Firing inside the critical path
• Actions Firing outside the critical pathThe slow down factor is a free variable
(i.e. extend the arc’s length)
Outside the CP
Along the CP
![Page 113: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/113.jpg)
113
Multi Clock Domain Partitioning
MCD Partitioning Analytical
Analytical Problem Formulation: (LP-formulation)
Mapping Function (Actor Clock Domain):
Critical Path
Outside the Critical Path
![Page 114: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/114.jpg)
114
Multi Clock Domain Partitioning
MCD Partitioning Heuristic
Heuristic Approach: (sub-optimal!)Try to slow down iteratively the clocks for actors outside the critical path
If slowing down a clock the clock of an actor outside the critical path does
not change the performance, this isits new (temporarily) clock domain
![Page 115: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/115.jpg)
115
Experimental Results: MPEG4-SP decoder
Design Case MPEG4-SP CAL Design
Application:MPEG4-SP decoder41 RVC-CAL actors
Platforms:Virtex-5 FPGA
Execution Trace:Steps: ≈ 106
Dependencies: ≈ 107
![Page 116: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/116.jpg)
116
Experimental Results: MPEG4-SP decoder
Design Case Results
Power Reduction, compared to nominal performances(i.e. all actors mapped in the clock domain with highest frequency)
![Page 117: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/117.jpg)
117
• Buffer dimensioning
?→
?
?
????
????
?
??
??
?
?
?
Dataflow programs: design exploration problem
• Partitioning
• Scheduling
![Page 118: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/118.jpg)
Conclusions
![Page 119: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/119.jpg)
119
Conclusions
Conclusion
• Dataflow (stream) programming provides the opportunity of a unified high levelsystem design for both SW and HW implementations on heterogeneousplatforms
• Exploration of the VERY LARGE design space (dataflow stream architecture,HW/SW partitioning, buffer dim, SW core partitioning, scheduling, clockassignment, …) and optimization of parameters according to design criteria is aMUST!!
• Very interesting results and design optimizations can be obtained for dynamicdataflow programs by representing and exploring the execution trace
• Still progresses in tools support and further optimizations are underinvestigation: language extensions, design using mixed MoC, automaticsupport for dataflow refactoring, advanced scheduling, joint optimizationsheuristics ……..
![Page 120: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/120.jpg)
120
Bibliography (1)Simone Casale-Brunet, Małgorzata Michalska, Endri Bezati, Junaid Jameel Ahmad, Marco Mattavelli: “Profiling and high-precision performance estimation of dynamic dataflow programs on multi-core platforms” in “Modeling, Optimization and Simulation for High Performance and Ultra-Low Power Circuits and Systems”, Integration, the VLSI Journal.[To appear in 2018].
Simone Casale-Brunet and M. Mattavelli: “Execution trace graph of dataflow process networks”, IEEE Transactions on Multi-Scale Computing Systems, Volume: PP, Issue: 99, January 2018, Print ISSN: 2332-7766, DOI: 10.1109/TMSCS.2018.2790921
M. Michalska, S. Casale-Brunet , E. Bezati, and M. Mattavelli: “High-Precision Performance Estimation for the Design Space Exploration of Dynamic Dataflow Programs”, IEEE Transactions on Multi-Scale Computing Systems, Volume: PP, Issue: 99, November 2017, Print ISSN: 2332-7766, Digital Object Identifier: 10.1109/TMSCS.2017.2774294.
Casale-Brunet, S; Mattavelli, M.; Janneck, J.W., "Buffer optimization based on critical path analysis of a dataflow program design," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, vol., no., pp.1384,1387, Beijing, China, 19-23 May 2013, doi: 10.1109/ISCAS.2013.6572113
Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "TURNUS: A design exploration framework for dataflow system design," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, vol., no., pp.654,654, Beijing, China, 19-23 May 2013 doi: 10.1109/ISCAS.2013.6571927
- Małgorzata Michalska, Nicolas Zufferey and Marco Mattavelli: "Performance Estimation Based Multicriteria Partitioning Approach for Dynamic Dataflow Programs," Journal of Electrical and Computer Engineering, vol. 2016, Article ID 8536432, 15 pages, 2016. doi:10.1155/2016/8536432, http://www.hindawi.com/journals/jece/2016/8536432/ .
- M. Canale, S. Casale-Brunet, E. Bezati, M. Mattavelli, J. Janneck: "Dataflow Programs Analysis and Optimization Using Model Predictive Control Techniques", Journal of Signal Processing Systems, 2016, Vol: 84, No. 3, Pages 371--381, issn="1939-8115", Doi: 10.1007/s11265-015-1083-4, http://dx.doi.org/10.1007/s11265-015-1083-4 http://link.springer.com/article/10.1007/s11265-015-1083-4
- E. Bezati; S. Casale Brunet; M. Mattavelli; J. W. Janneck, "Clock-gating of streaming applications for energy efficient implementations on FPGAs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. PP, Issue no.99, pp.1-1, Aug. 2016, doi: 10.1109/TCAD.2016.2597215.
- K. Jerbi, D. Renzi, D. De Saint Jorre, H. Yviquel, M. Raulet, C. Alberti, M. Mattavelli: “Development and optimization of high level dataflow programs: the HEVC decoder design case”, Journal of Signal Processing Systems, 2016.http://link.springer.com/article/10.1007/s11265-016-1113-x
![Page 121: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/121.jpg)
121
Bibliography (2)Anatoly Prihozhy, Endri Bezati, Ab Al-Hadi Ab Rahman, Marco Mattavelli: "Synthesis and Optimization of Pipelines for HW Implementations of Dataflow Programs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 10, October 2015, Pag. 1613-1626.
Jani Boutellier, Johan Ersfollk, Johan Lilius, Marco Mattavelli, Ghislain Roquier, Olli Silven: "Actor Merging for Dataflow Process Networks" IEEE Transactions on Signal Processing, Vol. 63, No. 10, May 15, 2015, Pag. 2496-2508.
C. Sau, P. Meloni, L. Raffo, F. Palumbo, E. Bezati, S. Casale Brunet and M. Mattavelli, “Automated Design Flow for Multi-Functional Dataflow-Based Platforms,” in Journal of Signal Processing Systems -Signal Image and Video Technology-, p. 1-23, 2015.http://link.springer.com/article/10.1007/s11265-015-1026-0
C. Massimo, S. Casale Brunet, E. Bezati, M. Mattavelli and J. Janneck, “Dataflow Programs Analysis and Optimization Using Model Predictive Control Techniques. Two Examples of Bounded Buffer Scheduling: Deadlock Avoidance and Deadlock Recovery Strategies,” in Journal of Signal Processing Systems, p. 1-11, 2015. Doi: 10.1007/s11265-015-1083-4.http://link.springer.com/article/10.1007%2Fs11265-015-1083-4.
M. Michalska, E. Bezati, S. Casale-Brunet and M. Mattavelli, "Buffer dimensioning for throughput improvement of dynamic dataflow signal processing applications on multi-core platforms," 2017 25th European Signal Processing Conference (EUSIPCO), Kos, 2017, pp. 1339-1343, DOI: 10.23919/EUSIPCO.2017.8081426
S. Casale-Brunet, E. Bezati and M. Mattavelli, "Design space exploration of dataflow-based Smith-Waterman FPGA implementations," 2017 IEEE International Workshop on Signal Processing Systems (SiPS), Lorient, France, 2017, pp. 1-6, DOI: 10.1109/SiPS.2017.8109982
S. Casale-Brunet, E. Bezati, A. Bloch and M. Mattavelli, "High-level system synthesis and optimization of dataflow programs for MPSoCs”, "2017 51th Asilomar Conference on Signals, Systems and Computers”, Pacific Grove, CA, 2017.
E. Bezati, S. Casale-Brunet and M. Mattavelli, "Execution Trace Graph based interface synthesis of signal processing dataflow programs for heterogeneous MPSoCs”, "2017 51th Asilomar Conference on Signals, Systems and Computers", Pacific Grove, CA, 2017
E. Bezati, S. Casale Brunet, M. Mattavelli, J. W. Janneck: "High-level system synthesis and optimization of dataflow programs for MPSoCs", in 2016 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, November 6-9, 2016.
E. Bezati; S. Casale Brunet; M. Mattavelli; J. W. Janneck, "High-Level Synthesis of Dynamic Dataflow Programs on heterogeneous MPSoC platforms", in 2016 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).
![Page 122: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/122.jpg)
122
Bibliography (3)Simone Casale-Brunet, Endri Bezati, Marco Mattavelli: “Programming Models and Methods for Heterogeneous Parallel Embedded Systems”, 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Lyon, France, September 21-23 2016.
Simone Casale-Brunet, Endri Bezati, Marco Mattavelli: “High Level Synthesis of Smith-Waterman dataflow implementations”, IEEE 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016
M. Michalska, S. Casale Brunet, E. Bezati, M. Mattavelli, J. Janneck. Trace-Based Manycore Partitioning of Stream-Processing Applications. 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, November 6-9, 2016.
M. Michalska, S. Casale Brunet, E. Bezati, M. Mattavelli: “High-precision performance estimation of dynamic dataflow programs”. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, IEEE MCSoC 2016, Lyon, France, September 21-23 2016.
M. Michalska, N. Zufferey, E. Bezati, M. Mattavelli: “Design space exploration problem formulation for dataflow programs on heterogeneous architectures”. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, IEEE MCSoC2016, Lyon, France, September 21-23 2016.
C. Alberti, N. Daniels, M. Hernaez, J. Voges, R. L. Goldfeder, A. A. Hernandez-Lopez, M. Mattavelli, B. Berger: “An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values”, IEEE 2016 Data Compression Conference, Snowbird, Utah, USA, March 27-29, 2016, DOI 10.1109/DCC.2016.39.
M. Michalska, J.J. Ahmad, E. Bezati, S. Casale-Brunet, M. Mattavelli: “Performance Estimation of Program Partitions on Multi-core Platforms”. International Workshop on Power and Timing Modeling, Optimization and Simulation, Bremen, Germany, September 21-23 2016.
M. Michalska, N. Zufferey, M. Mattavelli: “Tabu Search for Partitioning Dynamic Dataflow Programs”. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, Procedia Computer Science, Volume 80, Pages 1577-1588, 2016.
M. Michalska, E. Bezati, S. Casale-Brunet, M. Mattavelli: “A Partition Scheduler Model for Dynamic Dataflow Programs”. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, Procedia Computer Science, Volume 80, Pages 2287–2291, 2016.
![Page 123: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/123.jpg)
123
Bibliography (4)M. Michalska, N. Zufferey, J. Boutellier, E. Bezati and M. Mattavelli. Efficient scheduling policies for dataflow programs executed on multi-core. Proceedings of the 9th International Workshop on Programmability and Architectures for Heterogeneous Multicore (MULTIPROG), Prague, Czech Republic, January 18, 2016.
M. Michalska, J. Boutellier and M. Mattavelli: “A methodology for Profiling and Partitioning Stream Programs on Many-core Architectures: ”International Conference on Computational Science (ICCS), Reykjavik, Iceland, June 1-3, 2015, Procedia Computer Science, Vol 51, pag. 2962-2966. Doi: 10.1016/j.procs.2015.05.498.http://www.sciencedirect.com/science/article/pii/S187705091501306X
M. Michalska, S. Casale-Brunet, E. Bezati and M. Mattavelli: “Execution Trace Graph Based Multi-criteria Partitioning of Stream Programs: ”International Conference on Computational Science (ICCS), Reykjavik, Iceland, June 1-3, 2015, Procedia Computer Science, Vol 51, pag. 1443-1452. Doi: 10.1016/j.procs.2015.05.334http://www.sciencedirect.com/science/article/pii/S1877050915011424
- Canale, M.; Casale-Brunet, S.; Bezati, E.; Mattavelli, M.; Janneck, J.W., “Dataflow Programs Analysis and Optimization using Model Predictive Control Techniques: an example of bounded buffer scheduling”, SiPS, 2014, Belfast, North Ireland.http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6986054DOI: 10.1109/SiPS.2014.6986054
- Casale-Brunet, S.; Bezati, E.; Mattavelli, M.; Canale, M.; Janneck, J.W., “Execution trace graph analysis of dataflow programs: bounded buffer scheduling and deadlock recovery using model predictive control”, Conference on Design and Architectures for Signal and Image Processing (DASIP 2014), October 2014, Madrid, Spain. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7115623DOI: 10.1109/DASIP.2014.7115623
- S. Casale Brunet, M. Michalska, E. Bezati, M. Mattavelli, J. Janneck and M. Canale: “TURNUS: an open-source design space exploration framework for dynamic stream programs,”, Conference on Design and Architectures for Signal and Image Processing (DASIP), October2014, Madrid, Spain.http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7115614DOI: 10.1109/DASIP.2014.7115614
- ICIP 2014 Damien de Saint Jorre, Claudio Alberti, Marco Mattavelli, Simone Casale-Brunet "Exploring MPEG HEVC decoder parallelism for the efficient porting onto many-core platforms" 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27-30 October 2014, p. 2115-2119, doi:10.1109/ICIP.2014.7025424.
![Page 124: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/124.jpg)
124
Bibliography (5)E. Bezati, S. Casale-Brunet, M. Mattavelli, J.W. Janneck: " Coarse Grain Clock Gating of Streaming Applications In Programmable Logic Implementations," Proceedings Of The 2014 Electronic System Level Synthesis Conference (Eslsyn), 2014, San Francisco, CA, USA, Doi:10.1109/ESLsyn.2014.6850387, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6850387.
C. Sau, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, M. Mattavelli: “Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case”, IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014, Agios Konstantinos, Greece,DOI: 10.1109/SAMOS.2014.6893195http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6893195.
J.W. Janneck., S. Casale-Brunet, M. Mattavelli: “Characterizing Communication Behaviour of Dataflow Programs Using Trace Analysis”, IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014, AgiosKonstantinos, Greece, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6893193DOI: 10.1109/SAMOS.2014.6893193.
Rahman, A.A.A, Casale-Brunet S., Alberti C., Mattavelli M., “A Methodology for Optimizing Buffer Sizes of Dynamic Dataflow FPGAs Implementations”, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2014), Firenze, Italy,http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854554DOI: 10.1109/ICASSP.2014.6854554
Ersfolk, J.; Roquier, G.; Lund, W.; Mattavelli, M. & Lilius, J., “Static and Quasi-static Compositions of Stream Processing Applications from Dynamic Dataflow Programs”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013.
Bezati, E.; Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "Synthesis and optimization of high-level stream programs," Electronic System Level Synthesis Conference (ESLsyn), 2013 , pp.1-6, Austin, Texas, USA, May 31-June 1, 2013.
Bezati, E., Thavot, R., Roquier, G., & Mattavelli, M. High-level dataflow design of signal processing systems for reconfigurable and multicore heterogeneous platforms. Journal of Real-Time Image Processing, 2013, doi:10.1007/s11554-013-0326-5.
Bezati, E.; Roquier, G.; Mattavelli, M., "Live demonstration: High level software and hardware synthesis of dataflow programs," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on , vol., no., pp.660,660, Beijing, China, 19-23 May 2013, doi: 10.1109/ISCAS.2013.6571930.
![Page 125: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/125.jpg)
125
Bibliography (6)Casale-Brunet, S.; Mattavelli, M.; Alberti, C.; Janneck, J.W., "Representing Guard Dependencies in Dataflow Execution Traces," Computational Intelligence, Communication Systems and Networks (CICSyN), 2013 Fifth International Conference on , vol., no., pp.291,295, Madrid, Spain, 5-7 June 2013, doi: 10.1109/CICSYN.2013.26,
Casale-Brunet, S.; Mattavelli, M.; Alberti, C.; Janneck, J.W., “Design Space Exploration of High Level Stream Programs on Parallel Architectures: A focus on the Buffer Size Minimization and Optimization Problem,” 8th International Symposium on Image and Signal Processing and Analysis (ISPA 2013), Trieste, Italy, 4-6 September 2013
Casale-Brunet, S.; Alberti, C.; Mattavelli, M.; Janneck, J.W., “TURNUS: a unified dataflow design space exploration framework for heterogeneous parallel systems,“ Conference on Design & Architectures for Signal & Image Processing (DASIP), Cagliari, Italy, 8-10 October 2013
Casale-Brunet, S.; Bezati, E.; Roquier, G., Alberti, C.; Mattavelli, M.; Janneck, J.W.; Boutellier, J., “Design Space Exploration and Implementation of RVC-CAL Applications using the TURNUS framework,“ Conference on Design & Architectures for Signal & Image Processing (DASIP), Cagliari, Italy, 8-10 October 2013
Casale-Brunet, S.; Bezati, E.; Alberti, C.; Mattavelli, M.; Amaldi, E.; Janneck, J.W., " Partitioning and Optimization of high level Stream applications for Multi Clock Domain Architectures," Signal Processing Systems (SiPS), 2013 IEEE Workshop on, Taipei, Taiwan, 16-18 October 2013
Casale-Brunet, S.; Bezati, E.; Alberti, C.; Mattavelli, M.; Amaldi, E.; Janneck, J.W., “Multi-clock domain optimization for reconfigurable architectures in high-level dataflow applications,” Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 3-6 November 2013
Casale-Brunet, S.; Alberti, C.; Mattavelli, M.; Janneck, J.W., “Systems Design Space Exploration by Serial Dataflow Program Executions,” Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 3-6 November 2013
Casale-Brunet, S.; Elguindy, A.; Bezati, E.; Thavot, R.; Roquier, G., Mattavelli, M.; Janneck, J.W., “Methods to explore design space for MPEG RMC codec specifications,” Signal Processing: Image Communication, 2013, ISSN 0923-5965, DOI: 10.1016/j.image.2013.08.012.
Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "Profiling of Dataflow Programs Using Post Mortem Causation Traces," Signal Processing Systems (SiPS), 2012 IEEE Workshop on , vol., no., pp.220,225, Québec City, Canada, 17-19 October 2012, doi: 10.1109/SiPS.2012.54,
![Page 126: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/126.jpg)
126
Bibliography (7)Bezati, E.; Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "Synthesis and optimization of high-level stream programs," Electronic System Level Synthesis Conference (ESLsyn), 2013 , vol., no., pp.1,6, Austin, Texas, USA, May 31-June 1 2013
Bezati, E., Thavot, R., Roquier, G., & Mattavelli, M. High-level dataflow design of signal processing systems for reconfigurable and multicoreheterogeneous platforms. Journal of Real-Time Image Processing, 2013, doi:10.1007/s11554-013-0326-5
Bezati, E.; Roquier, G.; Mattavelli, M., "Live demonstration: High level software and hardware synthesis of dataflow programs," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on , vol., no., pp.660,660, Beijing, China, 19-23 May 2013, doi: 10.1109/ISCAS.2013.6571930
Bezati E.; Mattavelli M.; Janneck, J.W, “High-Level Synthesis of Dataflow Programs for Signal Processing Systems”, 8th International Symposium on Image and Signal Processing and Analysis (ISPA 2013), Trieste, Italy, 4-6, September 2013
Rahman, A.A.A; Casale-Brunet, S.; Alberti, C.; Mattavelli, M., “Dataflow Program Analysis and Refactoring Techniques for Design Space Exploration: MPEG-4 AVC/H.264 Decoder Implementation Case Study,“ Conference on Design & Architectures for Signal & Image Processing (DASIP), Cagliari, Italy, 8-10 October 2013
Rahman, A.A.A.; Thavot, R.; Casale-Brunet, S.; Bezati, E.; Mattavelli, M., "Design space exploration strategies for FPGA implementation of signal processing systems using CAL dataflow program," Design and Architectures for Signal and Image Processing (DASIP), 2012 Conference on, vol., no., pp.1,8, Karlsruhe, Germany, 23-25 October 2012
Ahmad, J.J.; Li, S.; & Mattavelli, M., "Performance Benchmarking of RVC based Multimedia Specifications,” Proceedings of the 20th IEEE International Conference on Image Processing (ICIP), Melbourne Australia, September 2013 © IEEE.
Ahmad, J.J.; Li, S.; Thavot, R.; & Mattavelli, M., "Secure Computing with the MPEG RVC Framework," Signal Processing: Image Communication, August 2013 © Elsevier Science B. V., DOI: 10.1016/j.image.2013.08.015
Ersfolk, J.; Roquier, G.; Lund, W.; Mattavelli, M. & Lilius, J., “Static and Quasi-static Compositions of Stream Processing Applications from Dynamic Dataflow Programs”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013.
M. Mattavelli: "MPEG Reconfigurable Video Representation", in "The MPEG Representation of Digital Media", Chapter 12, Springer Ed. 2011, DOI 10.1007/978-1-4419-6184-6_12. http://www.springer.com/engineering/signals/book/978-1-4419-6183-9http://www.springerlink.com/content/q7728730l4215455/
M. Mattavelli, M. Raulet, J. Janneck: "MPEG Reconfigurable Video Coding" in "Handbook of Signal processing Systems", Pag. 43-68, S. S. Bhattacharyya, Ed F. Deprettere, R. Leupers and J. Takala Editors, Springer 2010. http://www.springer.com/engineering/signals/book/978-1-4419-6344-4
![Page 127: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/127.jpg)
127
Bibliography (8)Ab Al-Hadi Ab Rahman, Anatoly Prihozhy, Marco Mattavelli: "Pipeline Synthesis and Optimization of FPGA-based Video Processing Applications with CAL", Eurasip Journal on Image and Video Processing, 2011, 2011:19, http://jivp.eurasipjournals.com/content/2011/1/19.
Marco. Mattavelli, Ihab Amer and Mickael Raulet: "The reconfigurable video coding standard", IEEE Signal Processing Magazine, May 2010, Vol27, No. 3, pag. 159-167.
Shuvra S. Bhattacharyya, Gordon Brebner, Johan Eker, Jorn W. Janneck, Marco Mattavelli, Carl von Platen, and Mickael Raulet: “OpenDF - A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems”, ACM SIGARCH Computer Architecture News, Special Issue: MCC08 –Multicore Computing 2008, Volume 36, Number 5, pp 29-35, December 2008.
Ihab Amer, Christophe Lucarz , Marco Mattavelli, Ghislain Roquier, Mickael Raulet, Olivier Deforges, Jean-Francois Nezan: "Reconfigurable Video Coding: the Video Coding Standard for Multi-core Platforms", IEEE Signal Processing Magazine, Special issue on Multicore Platforms, Vol 26, No. 6, November 2009, pag 113-123.
Mickael Raulet, Marco Mattavelli, Jorn Janneck: "Guest editorial: Special Issue on Reconfigurable Video Coding", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0418-4, Link - http://www.springerlink.com/content/k0157333g2602l55/.
Christophe Lucarz, Jonathan Piat, Marco Mattavelli: "Automatic synthesis of parsers and validation of bitstreams within the MPEG Reconfigurable Video Coding Framework", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0395-7 Link -http://www.springerlink.com/content/gg261p657x483m91.
Hussein K Aman-Allah, Karim M Maarouf, Ehab A Hanna, Ihab M Amer, Marco Mattavelli: "CAL Dataflow Components For an MPEG RVC AVC Baseline Encoder", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0396-6 Link -http://www.springerlink.com/content/j487766m06306127.
Shuvra S. Bhattacharyya, Johan Eker, Jorn W. Janneck, Christophe Lucarz, Marco Mattavelli, Mickael Raulet: "Overview of the MPEG Reconfigurable Video Coding Framework", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0399-3 Link -http://www.springerlink.com/content/24t3610261j1628j.
![Page 128: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/128.jpg)
128
Credits and thanks to:
Conclusion
• Prof. Jorn Janneck, Prof. Edoardo Amaldi (Lund University, Politecnico diMilano)
• Simone Casale Bruner, Endri Bezati, Malgorzata Michalska, Claudio Alberti,Ghislain Roquier, Ab Al-Hadi Ab Rahman (EPFL)
• Dave Parlour, Ian Miller (Xilinx)
• Mickael Raulet, Mathieu Wipliez, JF Nezan (INSA)
• Johan Eker, Carl von Platen (Ericsson)
• And many others ………
![Page 129: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable](https://reader034.fdocuments.net/reader034/viewer/2022042023/5e7a9ce74a345d3f1d1fcb22/html5/thumbnails/129.jpg)
129
Thanks for your attention!