Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical...

129
August 28, 2018

Transcript of Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical...

Page 1: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

August 28, 2018

Page 2: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Clock and cores

Circuits are still getting larger at a steady pace...

… but speed-up is no longer the result of a faster clock. Instead, we get more parallel devices

Intel desktop processor top model

Page 3: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

The end of scaling

Original data collected and plotted by M. Horowitz, F. Labonte, O.Shacham, K. Olukotun, L. Hammond, and C. Batten. Dotted line extrapolations by C. Moore.C. Moore, Data Processing in ExaScale-ClassComputer Systems, April 2011

Page 4: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Performance = parallelism. Efficiency = locality.

It really is that simple. (Bill Dally)

approximate power in 28nm processorBill Dally

Power efficiency in CMOS technology

Page 5: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Parallelism and locality

Computer architecture

Programming

for(int i=1;i<=N_a;i++){ for(int j=1;j<=N_b;j++){ temp[0] = H[i-1][j-1]+ sim_score(seq_a[i-1],seq_b[j-1]); temp[1] = H[i-1][j]-delta; temp[2] = H[i][j-1]-delta; temp[3] = 0.; H[i][j] = find_array_max(temp,4); switch(ind){ case 0: I_i[i][j] = i-1; I_j[i][j] = j-1; break; case 1: I_i[i][j] = i-1; I_j[i][j] = j; break; case 2: I_i[i][j] = i; I_j[i][j] = j-1; break; case 3: I_i[i][j] = i; I_j[i][j] = j; break; } } }

???

Page 6: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

New Trends for Processing Architectures

Programming Applications on Parallel Machines

Am2045

TILE 64

GEFORGE 9400

PC102

SEAforth40C18

C w/ “iLib”

SOPM (“ajava”) C + VHDL

VentureForth

CUDA

HDL

Virtex

C + VHDL + Open CL

KALRAY BostanMPPA

C / C++ + Open CL

Xilinx Ultrascale + MPSoC

Page 7: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Programming Applications on Sequential Machines

Pentium IV

PDP-11

FujitsuSuperSPARC

68020

“Von Neumann” machineC, Pascal, Forth, Java, Lisp, Fortran, Haskell,...

Page 8: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Programming Concurrent Heterogeneous Machines?

??Processor

1

Multi-core1 to 16

Pentium IV

Intel® Core™ i9-7960X Processor

KALRAY BostanMPPA

NvidiaGEFORCE9400

FPGA1000s ~ 10000s

Xilinx Ultrascale + MPSoC

Many-core1 to 288

Page 9: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Platform(physical parallelism)

1 10 100 1000

1

10

100

1000

Application(explicit parallelism)

parallelizing compilers

Parallelization

Sequentialization

Portable Concurrency: the Need of a Paradigm Shift

Page 10: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Which Features do we need?

For software to scale to future parallel computing platforms as seamlessly as possible, it will be necessary to describe algorithms in a way that:1) exposes as much parallelism of the application as practical,2) provides simple and natural abstractions that help manage the high degree of parallelism and permits principled composition of and interaction between modules,3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on,4) is efficiently implementable on a wide range of computing substrates, including traditional sequential processors, shared-memory multicores, manycore processor arrays, and programmable logic devices, as well as their combinations.

Looking at the above criteria: shared-memory threads fulfill the first requirement, but essentially fail on the other three. hardware description languages such as VHDL and Verilog fulfill the first two criteria, but as they fail on the third point by assuming a particular model of (clocked) time, their implementability is essentially limited to hardware and hardware-like programmable logic (FPGAs).

Page 11: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Problems with sequential computation models

No explicit parallelism is available: lack of expressivity

A deterministic sequence of low level operation is fully specified: no distinction between what is necessary for a correct semantics and what is not (memory bandwidth optimizations are more difficult)

Memory bandwidth does not scale with the number of cores/processors

Page 12: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Why threading does not solve the problem?

Threads are sequential processes that share memory.

Pros:Languages require little or no syntactic changes to support threads, and operating systems and architectures have evolved to efficiently support them.

Cons:They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism.

Page 13: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Threads do not compose

n = 0;

t1 = n;n = t1 + 1;

t2 = n;n = t2 + 1;

Thread race on shared state

Thread 1: Thread 2:

Each thread is determinate.Their composition is not.In the domain of threads, determinacy is not compositional.

Page 14: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Dataflow programs: not exactly a new model …….

digital signal processingmedia processingvideo codingimage processing / analyticsaudionetworking / packet processing...

Page 15: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Data flow schemas (Dennis et al.)

networks of (potentially concurrent) functionstriggered by the availability of input dataexecuting in a sequence of atomic transitionsdescribed by firing rules

Page 16: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Kahn process networks (Kahn)

networks of stream functions… described as communicating tasks with blocking readsprecise criterion for determinacy (prefix-monotonicity)formal fixpoint semantics

Page 17: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Actors (Hewitt et al.)

Stateful, active objects (actors)Communicating via message passingNon-determinism and partial ordersFunctional actors with fixpoint semantics

Page 18: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

More Recent Dataflow Concepts

• computational kernels (aka “actors”)

•directed point-to-point connections

• lossless, order-preserving• asynchronous• conceptually unbounded

(but...)

• communicating discrete data packets (“tokens”)

Page 19: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Actors (& Actions)

Actions

State

• actors encapsulate state• only interact via ports• execution is a sequence of steps• each step executes an “action”• an action can: 1) read tokens 2) write tokens 3) modify actor state

written in CAL(Cal Actor Language)

Page 20: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

CAL a formal language for dataflow: origin and features

The CAL language is a data flow oriented language that has been specified and developed as subproject of the Ptolemy project at the University of California at Berkeley.

The initial CAL language specification has been released in December 2003. A ISO standard version of CAL has been released in Oct 2009.

CAL is a textual language that is used to define the functionality of dataflow components called “actors”.

CAL is implemented in at least three publicly available modeling and simulation environments, Ptolemy II, OpenDF and Orcc.

An actor is a modular component that encapsulates its own state, no other actor has access to it, and nothing other actors can do can modify the state of an actor.

The only interaction between actors is through FIFO channels connecting “output ports” to “input ports,” which they use to send and receive “tokens”.

This strong encapsulation leads to loosely coupled systems with very manageable and controllable actor interfaces.

The modularity of an actor assembly fosters concurrent development, it facilitates maintainability and understandability and makes systems constructed in this way more robust and easier to modify.

Page 21: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

21

CAL - A crash course

actor Sum () A ==> X:

s := 0;

action A: [a] ==> X: [s] do s := s + a; endend

actor Add () A, B ==> X:

action A: [a], B: [b] ==> X: [a + b] endend

actor Merge () A, B ==> X:

action A: [v] ==> X: [v] end action B: [v] ==> X: [v] endend

actor Split () A ==> P, N:

action A: [v] ==> P: [v] guard v >= 0 end

action A: [v] ==> N: [-v] guard v < 0 endend

actor BiasedMerge () A, B ==> X:

CopyA: action A: [v] ==> X: [v] end CopyB: action B: [v] ==> X: [v] end

priority CopyA > CopyB; endend

1. consumption & production of tokens

2. state

3. multiple actions

4. guards5. priorities

Page 22: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Actors: big and small2474, 0, 2482, 3034, 2490, 3026, 2498, 3018,2506, 2978, 2514, 2890, 2522, 2770, 2530, 2714,2538, 2658, 2546, 2634, 2554, 2610, 2562, 2586,2570, 2578, 1, 1, 1, 1, 2594, 2602,1, 1, 1, 1, 2618, 2626, 128, -128,124, -124, 2642, 2650, 120, -120, 116, -116,2666, 2690, 2674, 2682, 112, -112, 108, -108,2698, 2706, 104, -104, 100, -100, 2722, 2746,2730, 2738, 96, -96, 92, -92, 2754, 2762,88, -88, 84, -84, 2778, 2834, 2786, 2810,2794, 2802, 80, -80, 76, -76, 2818, 2826,72, -72, 68, -68, 2842, 2866, 2850, 2858,64, -64, 60, -60, 2874, 2882, 56, -56,52, -52, 2898, 2970, 2906, 2946, 2914, 2938,2922, 2930, 48, -48, 44, -44, 40, -40,2954, 2962, 36, -36, 32, -32, 28, -28,2986, 3010, 2994, 3002, 24, -24, 20, -20,16, -16, 12, -12, 8, -8, 4, -4];

// Starting indices into VLD table aboveint MCBPC_IVOP_START_INDEX = 0;int MCBPC_PVOP_START_INDEX = 16;int CBPY_START_INDEX = 58;int DCBITS_Y_START_INDEX = 92;int DCBITS_UV_START_INDEX = 118;int COEFF_INTER_START_INDEX = 144;int COEFF_INTRA_START_INDEX = 380;int MV_START_INDEX = 616;

// VLD decode engine.int( size=VLD_TABLE_ADDR_BITS ) vld_index;int( size=VLD_TABLE_DATA_BITS ) vld_codeword := 1;procedure start_vld_engine( int index )begin

vld_index := index;vld_codeword := 2;

endfunction vld_success() --> bool : bitand(vld_codeword,3) = 0 endfunction vld_failure() --> bool : bitand(vld_codeword,1) = 1 endfunction vld_result() --> int( size=VLD_TABLE_DATA_BITS ) : rshift(vld_codeword,2) endaction bits:[ b ] ==>guard

bitand(vld_codeword,3) = 2 do

vld_codeword := vld_table[ vld_index + if b then 1 else 0 end ];vld_index := rshift(vld_codeword,2);bit_count := bit_count + 1;

endvld_failure: action ==>guard

vld_failure()// do// println("Bad VLD codeword");

end

generic_done: action ==>guard

done_reading_bits()end

generic_done_with_bitread: action bits:[b] ==>guard

done_reading_bits()do

bit_count := bit_count + 1;end

// stuck: action ==>// do// set_bits_to_read(0);// reset_texture();

// println("Stuck at bit_count " + bit_count);// end

test_zero_byte: action ==>guard

done_reading_bits(),bitand( read_result(), 255 ) = 0

end

test_one_byte: action ==>guard

done_reading_bits(),bitand( read_result(), 255 ) = 1

end

request_byte: action ==>guard

done_reading_bits()do

set_bits_to_read( 8 );end

request_vol: action ==>do

set_bits_to_read( VOL_START_CODE_LENGTH );end

schedule fsm stuck_1a /* look_for_vo */ :

// Process a new VOLlook_for_vo ( look_for_vo ) --> vo_header;vo_header ( generic_done ) --> stuck;vo_header ( vo_header.good ) --> look_for_vol;

// Process a new VOLlook_for_vol ( generic_done ) --> stuck;look_for_vol ( vol_startcode.good ) --> vol_object;vol_object ( vol_object_layer_identification ) --> vol_aspect;vol_aspect ( vol_aspect.detailed ) --> vol_control;vol_aspect ( generic_done ) --> vol_control;vol_control ( vol_control.detailed ) --> vol_vbv;vol_control ( generic_done_with_bitread ) --> vol_shape;vol_vbv ( vol_vbv.detailed ) --> vol_shape;vol_vbv ( generic_done_with_bitread ) --> vol_shape;vol_shape ( vol_shape ) --> vol_time_inc_res;vol_time_inc_res ( vol_time_inc_res ) --> vol_width;vol_width ( vol_width ) --> vol_height;vol_height ( vol_height ) --> vol_misc;// vol_misc ( vol_misc.unsupported ) --> stuck;vol_misc ( generic_done ) --> look_for_vop;

// Process a new VOPlook_for_vop ( byte_align ) --> get_vop_code;get_vop_code ( get_vop_code ) --> vop_code;vop_code ( vo_header.good ) --> look_for_vol; // DBP: permit concatenation of sequences without proper end codevop_code ( vop_code.done ) --> look_for_vo; // DBP: start from the beginningvop_code ( vop_code.start ) --> vop_predict;vop_code ( generic_done ) --> stuck;vop_predict ( vop_predict /* .supported */ ) --> vop_timebase;// vop_predict ( generic_done ) --> stuck;vop_timebase ( vop_timebase.one ) --> vop_timebase;vop_timebase ( vop_timebase.zero ) --> vop_coding;vop_coding ( vop_coding.uncoded ) --> look_for_vop;

vop_coding ( vop_coding.coded ) --> send_new_vop_info;send_new_vop_info ( send_new_vop_cmd ) --> send_new_vop_width;send_new_vop_width ( send_new_vop_width ) --> send_new_vop_height;send_new_vop_height( send_new_vop_height ) --> mb;

// Start MBmb ( mb_done ) --> look_for_vop;

mb ( mcbpc_pvop_uncoded ) --> pvop_uncoded1;pvop_uncoded1 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded2;pvop_uncoded2 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded3;pvop_uncoded3 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded4;pvop_uncoded4 ( mcbpc_pvop_uncoded1 ) --> pvop_uncoded5;pvop_uncoded5 ( mcbpc_pvop_uncoded1 ) --> mb;

mb ( get_mcbpc ) --> get_mbtype;get_mbtype ( vld_failure ) --> stuck;get_mbtype ( get_mbtype ) --> final_cbpy;final_cbpy ( vld_failure ) --> stuck;final_cbpy ( final_cbpy_intra ) --> block;final_cbpy ( final_cbpy_inter ) --> mv;

// process all blocks in an MBblock ( mb_dispatch_done ) --> mb;block ( mb_dispatch_intra ) --> texture;block ( mb_dispatch_inter_ac_coded ) --> texture;block ( mb_dispatch_inter_no_ac ) --> block;

// Start texturetexture ( vld_start_intra ) --> get_dc_bits;texture ( vld_start_inter ) --> texac;get_dc_bits ( get_dc_bits.none ) --> texac;get_dc_bits ( get_dc_bits.some ) --> get_dc;get_dc_bits ( vld_failure ) --> stuck;get_dc ( dc_bits_shift ) --> get_dc_a;get_dc_a ( get_dc ) --> texac;texac ( block_done ) --> block;texac ( dct_coeff ) --> vld1;vld1 ( vld_code ) --> texac;vld1 ( vld_level ) --> vld4;vld1 ( vld_run_or_direct ) --> vld7;vld7 ( vld_run ) --> vld6;vld7 ( vld_direct_read ) --> vld_direct;vld1 ( vld_failure ) --> stuck;vld_direct ( vld_direct ) --> texac;vld4 ( do_level_lookup ) --> vld4a;vld4 ( vld_failure ) --> stuck;vld4a ( vld_level_lookup ) --> texac;vld6 ( do_run_lookup ) --> vld6a;vld6 ( vld_failure ) --> stuck;vld6a ( vld_run_lookup ) --> texac;

// mv()mv ( mvcode_done ) --> block;mv ( mvcode ) --> mag_x;mag_x ( vld_failure ) --> stuck;mag_x ( mag_x ) --> get_residual_x; get_residual_x ( get_residual_x ) --> mv_y;mv_y ( mvcode ) --> mag_y;mag_y ( vld_failure ) --> stuck;mag_y ( mag_y ) --> get_residual_y;get_residual_y ( get_residual_y ) --> mv;

// stuck( stuck ) --> stuck_for_good;

// DBP: add minimal error resilience.// byte align, then look for a VO header starting on any byte boundary.// Can't handle bit insertion or deletion, obviously. The VO header// is hex 00000100.stuck ( byte_align ) --> stuck_1a;

stuck_1a ( request_byte ) --> stuck_1b;stuck_1b ( test_zero_byte ) --> stuck_2a;stuck_1b ( generic_done ) --> stuck_1a;

stuck_2a ( request_byte ) --> stuck_2b;stuck_2b ( test_zero_byte ) --> stuck_3a;stuck_2b ( generic_done ) --> stuck_1a;

stuck_3a ( request_byte ) --> stuck_3b;stuck_3b ( test_zero_byte ) --> stuck_3a;stuck_3b ( test_one_byte ) --> stuck_4a;stuck_3b ( generic_done ) --> stuck_1a;

stuck_4a ( request_byte ) --> stuck_4b;stuck_4b ( test_zero_byte ) --> request_vol; // Found a good VO header (Assumes VO ID field=0)stuck_4b ( generic_done ) --> stuck_1a;

request_vol ( request_vol ) --> look_for_vol;

end

priority

vo_header.good > generic_done;vol_aspect.detailed > generic_done;vol_control.detailed > generic_done_with_bitread;vol_vbv.detailed > generic_done_with_bitread;// vol_misc.unsupported > generic_done;

vop_code.done > generic_done;vop_code.start > generic_done;// vop_predict.supported > generic_done;vop_timebase.one > vop_timebase.zero;vop_coding.uncoded > vop_coding.coded;

mb_done > get_mcbpc;mb_done > mcbpc_pvop_uncoded;get_mcbpc.pvop > mcbpc_pvop_uncoded;

get_mbtype.noac > get_mbtype.ac;final_cbpy_inter > final_cbpy_intra;mb_dispatch_done > mb_dispatch_intra >mb_dispatch_inter_no_ac > mb_dispatch_inter_ac_coded;

vld_start_intra > vld_start_inter;get_dc_bits.none > get_dc_bits.some;block_done > dct_coeff;vld_code > vld_level > vld_run_or_direct;vld_run > vld_direct_read;

vld_start_inter.ac_coded > vld_start_inter.not_ac_coded;

mvcode_done > mvcode;

test_zero_byte > generic_done;test_one_byte > generic_done;

end

end

/* BEGINCOPYRIGHT X

Copyright (c) 2004-2006, Xilinx Inc.All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:- Redistributions of source code must retain the above

copyright notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

- Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ENDCOPYRIGHT*/// ParseHeaders.cal//// Author: David B. Parlour ([email protected])//

import all caltrop.lib.BitOps;

// BTYPE Output// A single stream of control tokens is sent on the BTYPE output to distribute// VOP and block type information. A minimum of 12 bits is required for this token,// which also sends the VOP width and height parameters. The relevant parameters are:

// MB_COORD_SZ - size of any variable that contains a macroblock coordinate// BTYPE_SZ - size of the BTYPE output (min. 12)

// Bits to signal token type// NEWVOP 2048// INTRA 1024// INTER 512

// If the NEWVOP bits is set, the following field can be masked off:// QUANT_MASK 31// ROUND_TYPE 32// FCODE_MASK 448// FCODE_SHIFT 6

// The next two tokens after NEWVOP are the width and height in macroblocks

// The INTER type also has// ACPRED 1// ACCODED 2// FOURMV 4// MOTION 8

// The INTRA type also has ACPRED

actor ParseHeaders (// Maximum image width (in units of macroblocks) that the decoder can handle.// It is used to allocate line buffer space at compile time.int MAXW_IN_MB,

int MB_COORD_SZ,int BTYPE_SZ,int MV_SZ,int NEWVOP,int INTRA,int INTER,int QUANT_MASK,int ROUND_TYPE,int FCODE_MASK,int FCODE_SHIFT,int ACPRED,int ACCODED,int FOURMV,int MOTION,int SAMPLE_COUNT_SZ,int SAMPLE_SZ

) bool bits ==> int(size=BTYPE_SZ) BTYPE, int(size=MV_SZ) MV, int(size=SAMPLE_COUNT_SZ) RUN, int(size=SAMPLE_SZ) VALUE, bool LAST :

// Constants for various field lengths (bits) and special values.

int VO_HEADER_LENGTH = 27;int VO_NO_SHORT_HEADER = 8;int VO_ID_LENGTH = 5;

int VOL_START_CODE_LENGTH = 28;int VOL_START_CODE = 18;int VOL_ID_LENGTH = 5;int VIDEO_OBJECT_TYPE_INDICATION_LENGTH = 8;int VISUAL_OBJECT_LAYER_VERID_LENGTH = 4;int VISUAL_OBJECT_LAYER_PRIORITY_LENGTH = 3;int ASPECT_RATIO_INFO_LENGTH = 4;int ASPECT_RATIO_INFO_IS_DETAILED = 15;int PAR_WIDTH_LENGTH = 8;int PAR_HEIGHT_LENGTH = 8;int CHROMA_FORMAT_LENGTH = 2;int LOW_DELAY_LENGTH = 1;int FIRST_HALF_BIT_RATE_LENGTH = 15;int LAST_HALF_BIT_RATE_LENGTH = 15;int FIRST_HALF_VBV_BUF_SZ_LENGTH = 15;int LAST_HALF_VBV_BUF_SZ_LENGTH = 3;int FIRST_HALF_VBV_OCC_LENGTH = 11;int LAST_HALF_VBV_OCC_LENGTH = 15;int VOL_SHAPE_LENGTH = 2;int MARKER_LENGTH = 1;int TIME_INC_RES_LENGTH = 16;int VOL_WIDTH_LENGTH = 13;int VOL_HEIGHT_LENGTH = 13;

//int TIME_INC_RES_MASK = lshift( 1, TIME_INC_RES_LENGTH ) - 1;// Note: mask off width, height in units of MBs, not pixels//int VOL_WIDTH_MASK = lshift( 1, VOL_WIDTH_LENGTH - 4 ) - 1;//int VOL_HEIGHT_MASK = lshift( 1, VOL_HEIGHT_LENGTH - 4 ) - 1;//int ASPECT_RATIO_INFO_MASK = lshift( 1, ASPECT_RATIO_INFO_LENGTH ) - 1;

int RUN_LENGTH = 6;int RUN_MASK = lshift( 1, RUN_LENGTH ) - 1;int LEVEL_LENGTH = 12;int LEVEL_MASK = lshift( 1, LEVEL_LENGTH ) - 1;

// count of misc VOL bits:// interlaced OBMC_disabled vol_sprite_usage not_8_bit// vol_quant_type complexity_est_disable resync_marker_disable data_partitioning_enable// scalability// Note: We assume data_partitioning_enable is always false, so the model// does not do a conditional read of reversible_vlc_enableint MISC_BIT_LENGTH = 9;

int VOP_START_CODE_LENGTH = 32;int VOP_START_CODE = 438;int VOP_PREDICTION_LENGTH = 2;int B_VOP = 2;int P_VOP = 1;int I_VOP = 0;int INTRA_DC_VLC_THR_LENGTH = 3;int VOP_FCODE_FOR_LENGTH = 3;int VOP_FCODE_FOR_MASK = lshift( 1, VOP_FCODE_FOR_LENGTH ) - 1;int BITS_QUANT = 5;int BITS_QUANT_MASK = lshift( 1, BITS_QUANT ) - 1;int MCBPC_LENGTH = 9;int ESCAPE_CODE = 7167;

function mask_bits( v, n ) --> int :bitand( v, lshift(1,n)-1 )

end

// Utility action to read a specified number of bits.// This is an unnamed action, ie it is always enabled and has highest priority.// Use the procedures set_bits_to_read() to start reading, test for// completion with the boolean done_reading_bits() and get the value// with read_result(). Use the done function in a guard to wait for the// reading to be complete.int(size=7) bits_to_read_count := -1;int(size=33) read_result_in_progress;procedure set_bits_to_read( int count )begin

bits_to_read_count := count - 1;read_result_in_progress := 0;

endfunction done_reading_bits() --> bool : bits_to_read_count < 0 endfunction read_result() --> int : read_result_in_progress endaction bits:[ b ] ==>guard

not done_reading_bits()doread_result_in_progress := bitor( lshift( read_result_in_progress, 1), if b then 1 else 0 end );bits_to_read_count := bits_to_read_count - 1;bit_count := bit_count + 1;

end

int(size=4) bit_count := 0;

/********************************************************************************************************************************** start VOL **********************************************************************************************************************************/

// DBP: modified to support only VO_NO_SHORT_HEADERlook_for_vo: action ==>do

set_bits_to_read( VO_HEADER_LENGTH + VO_ID_LENGTH );end

// We can only handle VOL without short headervo_header.good: action ==>guard

done_reading_bits(),mask_bits( rshift( read_result(), VO_ID_LENGTH ), VO_HEADER_LENGTH ) = VO_NO_SHORT_HEADER

doset_bits_to_read( VOL_START_CODE_LENGTH );

end

vol_startcode.good: action ==>guard

done_reading_bits(),mask_bits( read_result(), VOL_START_CODE_LENGTH ) = VOL_START_CODE

do// Ignore the next two fieldsset_bits_to_read( VOL_ID_LENGTH + VIDEO_OBJECT_TYPE_INDICATION_LENGTH );

end

vol_object_layer_identification: action bits:[b] ==>guard

done_reading_bits()do

set_bits_to_read(if b then

// is_object_layer_identifier assertedVISUAL_OBJECT_LAYER_VERID_LENGTH + VISUAL_OBJECT_LAYER_PRIORITY_LENGTH + ASPECT_RATIO_INFO_LENGTH

elseASPECT_RATIO_INFO_LENGTH

end );bit_count := bit_count + 1;

end

vol_aspect.detailed: action ==>guard

done_reading_bits(),mask_bits( read_result(), ASPECT_RATIO_INFO_LENGTH ) = ASPECT_RATIO_INFO_IS_DETAILED

do// Skip over aspect ratio detailsset_bits_to_read( PAR_WIDTH_LENGTH + PAR_HEIGHT_LENGTH );

end

vol_control.detailed: action bits:[b] ==>guard

done_reading_bits(),b

doset_bits_to_read( CHROMA_FORMAT_LENGTH + LOW_DELAY_LENGTH );bit_count := bit_count + 1;

end

vol_vbv.detailed: action bits:[b] ==>guard

done_reading_bits(),b

doset_bits_to_read( FIRST_HALF_BIT_RATE_LENGTH + MARKER_LENGTH +

LAST_HALF_BIT_RATE_LENGTH + MARKER_LENGTH +FIRST_HALF_VBV_BUF_SZ_LENGTH + MARKER_LENGTH +LAST_HALF_VBV_BUF_SZ_LENGTH +FIRST_HALF_VBV_OCC_LENGTH + MARKER_LENGTH +LAST_HALF_VBV_OCC_LENGTH + MARKER_LENGTH );

bit_count := bit_count + 1;end

vol_shape: action ==>guard

done_reading_bits()do

set_bits_to_read( VOL_SHAPE_LENGTH + MARKER_LENGTH + TIME_INC_RES_LENGTH + MARKER_LENGTH + 1 );end

int(size=7) mylog;

vol_time_inc_res: action ==>guard

done_reading_bits()var

int(size=TIME_INC_RES_LENGTH+1) time_inc_res := mask_bits( rshift( read_result(), 2 ), TIME_INC_RES_LENGTH ), // pick out the time_inc_res fieldint(size=7) count := 0,int(size=7) ones := 0

dowhile ( count = 0 or time_inc_res != 0 ) do

if bitand( time_inc_res, 1 ) = 1 thenones := ones + 1;

endcount := count + 1;time_inc_res := rshift( time_inc_res, 1 );

endmylog := if ones > 1 then count else count - 1 end;mylog := if mylog < 1 then 1 else mylog end;set_bits_to_read(if bitand( read_result(), 1 ) = 1 then

// fixed vop rate//mylog + MARKER_LENGTH mylog

else// variable vop rate//MARKER_LENGTH0

end + MARKER_LENGTH + VOL_WIDTH_LENGTH + MARKER_LENGTH );end

// The vol width and height in units of macroblocks, ie. the pixel w/h divided by 16.// Note: there is no provision in the model for pixel sizes that are no multiples of 16.int(size=MB_COORD_SZ) vol_width;int(size=MB_COORD_SZ) vol_height;

vol_width: action ==>guard

done_reading_bits()do

vol_width := mask_bits( rshift( read_result(), MARKER_LENGTH + 4 ), VOL_WIDTH_LENGTH-4 ); // strip marker and divide by 16set_bits_to_read( VOL_HEIGHT_LENGTH + MARKER_LENGTH );

end

vol_height: action ==>guard

done_reading_bits()do

vol_height := mask_bits( rshift( read_result(), MARKER_LENGTH + 4 ), VOL_HEIGHT_LENGTH ); // strip marker and divide by 16set_bits_to_read( MISC_BIT_LENGTH );

end

// Detect unsupported features:// interlaced, sprites, not 8-bit pixels, not using method 2 quantization, data partitioning, scalability// vol_misc.unsupported: action ==>// guard// done_reading_bits(),

// test mask is 9-bit binary 101110011// bitand( read_result(), 371 ) != 0 // end

/********************************************************************************************************************************** start VOP **********************************************************************************************************************************/

byte_align: action ==>do

set_bits_to_read( 8 - bitand( bit_count, 7 ) );end

// Note: no check for correct bit stuffing

get_vop_code: action ==>guard

done_reading_bits()do

set_bits_to_read( VOP_START_CODE_LENGTH );end

// Note: the model does not support user data herevop_code.done: action ==>guard

done_reading_bits(),read_result() = 1

dobit_count := 0;

end

int(size=MB_COORD_SZ) mbx;int(size=MB_COORD_SZ) mby;

vop_code.start: action ==>guard

done_reading_bits(),read_result() = VOP_START_CODE

dombx := 0;mby := 0;set_bits_to_read( VOP_PREDICTION_LENGTH );

end

// int(size=2) prediction_type;

// vop_predict.supported: action ==>// guard// done_reading_bits(),// mask_bits( read_result(), VOP_PREDICTION_LENGTH ) = I_VOP or// mask_bits( read_result(), VOP_PREDICTION_LENGTH ) = P_VOP// do// prediction_type := read_result();// end

bool prediction_is_IVOP;

vop_predict: action ==>guard

done_reading_bits()do

prediction_is_IVOP := mask_bits( read_result(), VOP_PREDICTION_LENGTH ) = I_VOP;end

// Note: the model does not support time_base. It just skips over the right number of bits.vop_timebase.one: action bits:[b] ==>guard

bdo

// Skip over module-time_basebit_count := bit_count + 1;

end

vop_timebase.zero:action bits:[b] ==>do

bit_count := bit_count + 1;set_bits_to_read( MARKER_LENGTH + mylog + MARKER_LENGTH );

end

int(size=4) comp;

// TODO: the model does not communicate to the display driver// to re-use the current VOP in place of the uncoded one.vop_coding.uncoded: action bits:[b] ==>guard

done_reading_bits(),not b

docomp := 0;bit_count := bit_count + 1;

end

vop_coding.coded: action bits:[b] ==>guard

done_reading_bits()do

set_bits_to_read(if not prediction_is_IVOP then

// round_type, intra_dc_vlc_thr[3], vop_quant[5], vop_fcode_for[3]1 + INTRA_DC_VLC_THR_LENGTH + BITS_QUANT + VOP_FCODE_FOR_LENGTH

else// intra_dc_vlc_thr[3], vop_quant[5]INTRA_DC_VLC_THR_LENGTH + BITS_QUANT

end );bit_count := bit_count + 1;

end

// int(size=10) resync_marker_length;int(size=VOP_FCODE_FOR_LENGTH+1) fcode;

send_new_vop_cmd: action ==> BTYPE:[ cmd ]guard

done_reading_bits()var

bool round := false,int(size=BTYPE_SZ) cmd := bitor( NEWVOP, if prediction_is_IVOP then INTRA else INTER end ),int(size=BITS_QUANT+1) vop_quant

doif not prediction_is_IVOP then

round := bitand( rshift( read_result(), INTRA_DC_VLC_THR_LENGTH + BITS_QUANT + VOP_FCODE_FOR_LENGTH ), 1 ) = 1;vop_quant := bitand( rshift( read_result(), VOP_FCODE_FOR_LENGTH ), BITS_QUANT_MASK );fcode := bitand( read_result(), VOP_FCODE_FOR_MASK );// resync_marker_length := 16 + fcode;

elsevop_quant := bitand( read_result(), BITS_QUANT_MASK );fcode := 0;// resync_marker_length := 17;

endcmd := bitor( cmd, bitand(vop_quant,QUANT_MASK) );cmd := bitor( cmd, if round then ROUND_TYPE else 0 end );cmd := bitor( cmd, bitand( lshift( fcode, FCODE_SHIFT), FCODE_MASK) );

end

send_new_vop_width: action ==> BTYPE: [ vol_width ] end

send_new_vop_height: action ==> BTYPE: [ vol_height ] end

/********************************************************************************************************************************** start MB **********************************************************************************************************************************/

int CBP_SZ = 7;

int(size=CBP_SZ) cbp;

// Advance the mb coordinates with wrapprocedure next_mbxy()begin

mbx := mbx + 1;if mbx = vol_width then

mbx := 0;mby := mby + 1;

endend

// Go look for next VOPmb_done: action ==>guard

mby = vol_heightend;

get_mcbpc.ivop: action ==>guard

prediction_is_IVOPdo

start_vld_engine( MCBPC_IVOP_START_INDEX );end

get_mcbpc.pvop: action bits:[b] ==>guard

not prediction_is_IVOP,not b

dostart_vld_engine( MCBPC_PVOP_START_INDEX );bit_count := bit_count + 1;

end

// Nothing to do - uncodedmcbpc_pvop_uncoded: action bits:[b] ==>

BTYPE:[ INTER ]guard

not prediction_is_IVOPdo

next_mbxy();bit_count := bit_count + 1;

end

mcbpc_pvop_uncoded1: action ==>BTYPE:[ INTER ]

end

bool acpredflag;bool btype_is_INTRA;int(size=CBP_SZ) cbpc;bool fourmvflag;

get_mbtype.noac: action ==>guard

vld_success(),type != 3,type != 4

varint mcbpc = vld_result(),int type = bitand( mcbpc, 7 )

dobtype_is_INTRA := type >= 3;fourmvflag := (type = 2);cbpc := bitand( rshift( mcbpc, 4 ), 3 );acpredflag := false;start_vld_engine( CBPY_START_INDEX );

end

get_mbtype.ac: action bits:[b] ==> guard

vld_success()var

int mcbpc = vld_result(),int type = bitand( mcbpc, 7 )

dobtype_is_INTRA := true;cbpc := bitand( rshift( mcbpc, 4 ), 3 );acpredflag := b;bit_count := bit_count + 1;start_vld_engine( CBPY_START_INDEX );

end

bool ac_coded;

int(size=4) mvcomp;

final_cbpy_inter: action ==>guard

vld_success(),not btype_is_INTRA

varint cbpy = 15 - vld_result()

docomp := 0;mvcomp := 0;cbp := bitor( lshift( cbpy, 2), cbpc );

end

final_cbpy_intra: action ==>guard

vld_success()var

int cbpy = vld_result()do

comp := 0;cbp := bitor( lshift(cbpy, 2), cbpc );

end

mb_dispatch_done: action ==>guard

comp = 6do

next_mbxy();end

mb_dispatch_intra: action ==> BTYPE:[ cmd ]guard

btype_is_INTRAvar

int(size=BTYPE_SZ) cmd := INTRAdo

ac_coded := bitand( cbp, lshift( 1, 5 - comp)) != 0;cmd := bitor( cmd, if ac_coded then ACCODED else 0 end );cmd := bitor( cmd, if acpredflag then ACPRED else 0 end );

end

mb_dispatch_inter_no_ac: action ==> BTYPE:[ bitor( INTER, bitor( MOTION, if fourmvflag then FOURMV else 0 end ) ) ]guard

bitand( cbp, lshift( 1, 5 - comp)) = 0do

ac_coded := false;comp := comp + 1;

end

mb_dispatch_inter_ac_coded: action ==> BTYPE:[ bitor(bitor( INTER, ACCODED ), bitor( MOTION, if fourmvflag then FOURMV else 0 end ) ) ]do

ac_coded := true;end

vld_start_intra: action ==>guard

btype_is_INTRAdo

start_vld_engine( if comp < 4 then DCBITS_Y_START_INDEX else DCBITS_UV_START_INDEX end );b_last := false;

end

// There are AC coefficientsvld_start_inter.ac_coded: action ==>guard

ac_codeddo

b_last := false;end

// No AC coefficientsvld_start_inter.not_ac_coded: action ==> RUN:[0], VALUE:[0], LAST:[true]do

b_last := true;end

// The Y DC value is at most 12 bits, UV at most 13 bitsint(size=5) dc_bits;

get_dc_bits.none: action ==> RUN:[0], VALUE:[0], LAST:[ not ac_coded ]guard

vld_success(),vld_result() = 0

dob_last := not ac_coded;

end

get_dc_bits.some: action ==>guard

vld_success()do

dc_bits := vld_result();set_bits_to_read( dc_bits );

end

int(size=14) msb_result;

dc_bits_shift: action ==>var

int(size=5) count = dc_bits,int(size=14) shift = 1

dowhile count > 1 do

shift := lshift( shift, 1 );count := count - 1;

endmsb_result := shift;

end

get_dc: action ==> RUN:[0], VALUE:[v], LAST:[ not ac_coded ]guard

done_reading_bits()var

int(size=14) v = read_result()do

if bitand( v, msb_result ) = 0 thenv := v + 1 - lshift( msb_result, 1 );

end set_bits_to_read( if dc_bits > 8 then MARKER_LENGTH else 0 end );b_last := not ac_coded;

end

bool b_last;

block_done: action ==>guard

done_reading_bits(),b_last

docomp := comp + 1;

end

dct_coeff: action ==>guard

done_reading_bits()do

start_vld_engine( if btype_is_INTRA then COEFF_INTRA_START_INDEX else COEFF_INTER_START_INDEX end );end

vld_code: action bits:[sign] ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]guard

vld_success(),val != ESCAPE_CODE

varint(size=VLD_TABLE_DATA_BITS) val = vld_result(),int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool last

dorun := if btype_is_INTRA then bitand( rshift( val, 8), 255)

else bitand( rshift( val, 4), 255) end;last := if btype_is_INTRA then bitand( rshift( val, 16), 1) != 0

else bitand( rshift( val, 12), 1) != 0 end;level := if btype_is_INTRA then bitand( val, 255)

else bitand( val, 15) end;b_last := last;bit_count := bit_count + 1;

end

vld_level: action bits:[level_offset] ==>guard

vld_success(),not level_offset

dobit_count := bit_count + 1;start_vld_engine( if btype_is_INTRA then COEFF_INTRA_START_INDEX else COEFF_INTER_START_INDEX end );

end

vld_run_or_direct: action bits:[level_offset] ==>guard

vld_success()do

bit_count := bit_count + 1;end

vld_run: action bits:[run_offset] ==>guard

not run_offsetdo

bit_count := bit_count + 1;start_vld_engine( if btype_is_INTRA then COEFF_INTRA_START_INDEX else COEFF_INTER_START_INDEX end );

end

vld_direct_read: action bits:[run_offset] ==>do

bit_count := bit_count + 1;set_bits_to_read( 1 + RUN_LENGTH + MARKER_LENGTH + LEVEL_LENGTH + MARKER_LENGTH );

end

vld_direct: action ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]guard

done_reading_bits()var

int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool sign,bool last

dolast := bitand( rshift( read_result(), RUN_LENGTH + MARKER_LENGTH + LEVEL_LENGTH + MARKER_LENGTH ), 1 ) != 0;run := bitand( rshift( read_result(), MARKER_LENGTH + LEVEL_LENGTH + MARKER_LENGTH ), RUN_MASK );level := bitand( rshift( read_result(), MARKER_LENGTH ), LEVEL_MASK );if level >= 2048 then

sign := true;level := 4096 - level;

elsesign := false;

endb_last := last;

end

function intra_max_level( bool last, int run) :if not last then

if run = 0 then 27 elseif run = 1 then 10 else

if run = 2 then 5 elseif run = 3 then 4 else

if run < 8 then 3 elseif run < 10 then 2 else

if run < 15 then 1 else 0 endend

endend

endend

endelse

if run = 0 then 8 elseif run = 1 then 3 else

if run < 7 then 2 elseif run < 21 then 1 else 0 end

endend

endend

end

function inter_max_level( bool last, int run) :if not last then

if run = 0 then 12 elseif run = 1 then 6 else

if run = 2 then 4 elseif run < 7 then 3 else

if run < 11 then 2 elseif run < 27 then 1 else 0 end

endend

endend

endelse

if run = 0 then 3 elseif run = 1 then 2 else

if run < 41 then 1 else 0 endend

endend

end

int(size=SAMPLE_SZ) level_lookup_inter;int(size=SAMPLE_SZ) level_lookup_intra;

do_level_lookup: action ==>guardvld_success()

varint(size=VLD_TABLE_DATA_BITS-2) val = vld_result()

dolevel_lookup_inter := inter_max_level( bitand( rshift( val, 12), 1) != 0, bitand( rshift( val, 4), 255));level_lookup_intra := intra_max_level( bitand( rshift( val, 16), 1) != 0, bitand( rshift( val, 8), 255));

end

vld_level_lookup: action bits:[sign] ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]var

int(size=VLD_TABLE_DATA_BITS-2) val = vld_result(),int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool last

dorun := if btype_is_INTRA then bitand( rshift( val, 8), 255)

else bitand( rshift( val, 4), 255) end;last := if btype_is_INTRA then bitand( rshift( val, 16), 1) != 0

else bitand( rshift( val, 12), 1) != 0 end;level := if btype_is_INTRA then bitand( val, 255) + level_lookup_intra

else bitand( val, 15) + level_lookup_inter end;b_last := last;bit_count := bit_count + 1;

end

function intra_max_run( bool last, int level) :if not last then

if level = 1 then 14 elseif level = 2 then 9 else

if level = 3 then 7 elseif level = 4 then 3 else

if level = 5 then 2 elseif level < 11 then 1 else 0 end

endend

endend

endelseif level = 1 then 20 else

if level = 2 then 6 elseif level = 3 then 1 else 0 end

endend

endend

function inter_max_run( bool last, int level) :if not last then

if level = 1 then 26 elseif level = 2 then 10 else

if level = 3 then 6 elseif level = 4 then 2 else

if level = 5 or level = 6 then 1 else 0 endend

endend

endelse

if level = 1 then 40 elseif level = 2 then 1 else 0 end

endend

end

int(size=SAMPLE_SZ) run_lookup_inter;int(size=SAMPLE_SZ) run_lookup_intra;

// Do lookup both ways do_run_lookup: action ==>guard

vld_success()var

int(size=VLD_TABLE_DATA_BITS-2) val = vld_result()do

run_lookup_inter := inter_max_run( bitand( rshift( val, 12), 1) != 0,bitand( val, 15) );

run_lookup_intra := intra_max_run( bitand( rshift( val, 16), 1) != 0, bitand( val, 255) );

end

vld_run_lookup: action bits:[sign] ==> VALUE:[if sign then -level else level end], RUN:[run], LAST:[ last ]var

int(size=VLD_TABLE_DATA_BITS-2) val = vld_result(),int(size=SAMPLE_COUNT_SZ) run,int(size=SAMPLE_SZ) level,bool last

dolast := if btype_is_INTRA then bitand( rshift( val, 16), 1) != 0

else bitand( rshift( val, 12), 1) != 0 end; level := if btype_is_INTRA then bitand( val, 255)

else bitand( val, 15) end; run := if btype_is_INTRA then bitand( rshift( val, 8), 255) + run_lookup_intra

else bitand( rshift( val, 4), 255) + run_lookup_inter end + 1; b_last := last;bit_count := bit_count + 1;

end

/********************************************************************************************************************************** Motion Decode **********************************************************************************************************************************/

mvcode_done: action ==>guard

mvcomp = 4 or (mvcomp = 1 and not fourmvflag )end

mvcode: action ==>do

start_vld_engine( MV_START_INDEX );end

mag_x: action ==> MV:[ mvval ]guard

vld_success()var

int (size=VLD_TABLE_DATA_BITS) mvval = vld_result()do

set_bits_to_read( if fcode <= 1 or mvval = 0 then 0 else fcode end );end

get_residual_x: action ==> MV:[ read_result() ]guard

done_reading_bits()end

mag_y: action ==> MV:[ mvval ]guard

vld_success()var

int (size=VLD_TABLE_DATA_BITS) mvval = vld_result()do

set_bits_to_read( if fcode <= 1 or mvval = 0 then 0 else fcode-1 end );end

get_residual_y: action ==> MV:[ read_result() ]guard

done_reading_bits()do

mvcomp := mvcomp + 1;end

int VLD_TABLE_ADDR_BITS = 12;int VLD_TABLE_DATA_BITS = 20;

list( type:int( size=VLD_TABLE_DATA_BITS ), size=760 ) vld_table = [

// Automatically-generated tables for bitwise MPEG-4 VLD//// Decoding proceeds as follows:// 1. Set index to a starting value for the desired code (see embedded comments).// 2. Read the next bit in the incoming stream.// 3. Fetch the value at table[ index + bit ]// 4. Take the following action based on table value:// 4.1. If lsb = 1, terminate decoding with illegal codeword error.// 4.2. If 2nd lsb = 1, the codeword is not complete,// so set index to value >> 2, go to step 2.// 4.3. For all other values stop decoding and return value >> 2

// start index for MCBPC_IVOP is 0// (cumulative table size is 16 words x 20 bits)

10, 12, 18, 58, 26, 76, 34, 16,42, 50, 1, 80, 144, 208, 140, 204,

// start index for MCBPC_PVOP is 16// (cumulative table size is 58 words x 20 bits)

74, 0, 82, 226, 90, 218, 98, 202,106, 178, 114, 162, 122, 146, 130, 138,1, 1, 208, 144, 154, 140, 80, 196,170, 204, 76, 200, 186, 194, 136, 72,132, 68, 210, 12, 16, 192, 128, 64,8, 4,

// start index for CBPY is 58// (cumulative table size is 92 words x 20 bits)

242, 338, 250, 314, 258, 298, 266, 290,274, 282, 1, 1, 24, 36, 32, 16,306, 0, 8, 4, 322, 330, 48, 40,56, 20, 346, 60, 354, 362, 52, 12,44, 28,

// start index for DCBITS_Y is 92// (cumulative table size is 118 words x 20 bits)

378, 466, 386, 458, 394, 16, 402, 20,410, 24, 418, 28, 426, 32, 434, 36,442, 40, 450, 44, 1, 48, 12, 0,8, 4,

// start index for DCBITS_UV is 118// (cumulative table size is 144 words x 20 bits)

482, 570, 490, 8, 498, 12, 506, 16,514, 20, 522, 24, 530, 28, 538, 32,546, 36, 554, 40, 562, 44, 1, 48,4, 0,

// start index for COEFF_INTER is 144// (cumulative table size is 380 words x 20 bits)

586, 1498, 594, 1426, 602, 1338, 610, 1194,618, 1066, 626, 874, 634, 818, 642, 794,650, 770, 658, 714, 666, 690, 674, 682,1, 1, 1, 1, 698, 706, 1, 1,1, 1, 722, 746, 730, 738, 1, 1,1, 1, 754, 762, 1, 1, 1, 1,778, 786, 16456, 16396, 44, 40, 802, 810,18180, 18116, 18052, 17988, 826, 850, 834, 842,584, 520, 456, 392, 858, 866, 328, 204,140, 80, 882, 28668, 890, 946, 898, 922,906, 914, 48, 84, 1476, 1540, 930, 938,18244, 18308, 18372, 18436, 954, 1010, 962, 986,970, 978, 88, 144, 268, 332, 994, 1002,396, 648, 1604, 1668, 1018, 1042, 1026, 1034,18500, 18564, 18628, 18692, 1050, 1058, 18756, 18820,18884, 18948, 1074, 1138, 1082, 1114, 1090, 1106,1098, 17924, 36, 32, 17860, 17796, 1122, 1130,17732, 17668, 17604, 17540, 1146, 1170, 1154, 1162,17476, 16392, 1412, 1348, 1178, 1186, 1284, 1220,1156, 1092, 1202, 1282, 1210, 1258, 1218, 1242,1226, 1234, 1028, 964, 264, 200, 1250, 17412,28, 24, 1266, 1274, 17348, 17284, 17220, 17156,1290, 1314, 1298, 1306, 17092, 17028, 16964, 900,1322, 1330, 836, 136, 76, 20, 1346, 1402,1354, 1378, 1362, 1370, 16900, 16836, 16772, 16708,1386, 1394, 772, 708, 644, 16, 1410, 1418,16644, 16580, 16516, 16452, 1434, 1482, 1442, 1466,1450, 1458, 580, 516, 452, 388, 1474, 324,72, 12, 1490, 16388, 260, 196, 4, 1506,68, 1514, 132, 8,

// start index for COEFF_INTRA is 380// (cumulative table size is 616 words x 20 bits)

1530, 2442, 1538, 2370, 1546, 2282, 1554, 2138,1562, 2010, 1570, 1818, 1578, 1762, 1586, 1738,1594, 1714, 1602, 1658, 1610, 1634, 1618, 1626,1, 1, 1, 1, 1642, 1650, 1, 1,1, 1, 1666, 1690, 1674, 1682, 1, 1,1, 1, 1698, 1706, 1, 1, 1, 1,1722, 1730, 262172, 262168, 88, 84, 1746, 1754,264200, 263180, 262164, 13316, 1770, 1794, 1778, 1786,5132, 8200, 4108, 3088, 1802, 1810, 2064, 1052,80, 76, 1826, 28668, 1834, 1890, 1842, 1866,1850, 1858, 92, 96, 1056, 9224, 1874, 1882,265224, 266248, 277508, 278532, 1898, 1954, 1906, 1930,1914, 1922, 100, 104, 108, 1060, 1938, 1946,6156, 1064, 2068, 7180, 1962, 1986, 1970, 1978,14340, 262176, 267272, 268296, 1994, 2002, 279556, 280580,281604, 282628, 2018, 2082, 2026, 2058, 2034, 2050,2042, 276484, 72, 68, 275460, 274436, 2066, 2074,273412, 272388, 263176, 262160, 2090, 2114, 2098, 2106,12292, 11268, 7176, 6152, 2122, 2130, 5128, 3084,2060, 1048, 2146, 2226, 2154, 2202, 2162, 2186,2170, 2178, 1044, 64, 4104, 60, 2194, 270340,56, 52, 2210, 2218, 269316, 268292, 262156, 10244,2234, 2258, 2242, 2250, 9220, 8196, 271364, 3080,2266, 2274, 1040, 48, 44, 40, 2290, 2346,2298, 2322, 2306, 2314, 266244, 265220, 6148, 267268,2330, 2338, 7172, 2056, 1036, 36, 2354, 2362,262152, 5124, 264196, 263172, 2378, 2426, 2386, 2410,2394, 2402, 4100, 3076, 32, 28, 2418, 2052,1032, 24, 2434, 262148, 20, 16, 4, 2450,8, 2458, 1028, 12,

// start index for MV is 616// (cumulative table size is 760 words x 20 bits)

/* BEGINCOPYRIGHT X

Copyright (c) 2007, Xilinx Inc.All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:- Redistributions of source code must retain the above

copyright notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

- Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ENDCOPYRIGHT*/

import all caltrop.lib.BitOps;

actor ForemanSource () int(size=8) MPEG, int(size=8) COMPARE ==> int(size=8) dataout, int(size=32) errors :

int mismatch := 0;int(size=24) count := 0;

action MPEG:[mpeg], COMPARE:[comp] ==> dataout:[ mpeg ]guard

mismatch = 0do

mismatch := bitxor(mpeg, comp);count := count + 1;

end

action ==> errors:[ bitor(lshift(count, 8), bitand(old mismatch, 0xFF)) ]guard

mismatch != 0do

mismatch := 0;end

end

Compare23 lines(without header comments)

ParseHeaders1320 lines(without header comments)

http://opendf.svn.sourceforge.net/viewvc/opendf/models/MPEG4_SP_Decoder/

Page 23: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Actor/Dataflow Programming Model

encapsulated state

Actions

Schedule State

point-to-point, buffered token-passing connections

actors guarded atomic actions

dataflowI/O interfaces

class MyActor{ schedule();readPort( portNum );writePort( portNum );

}

simulation

software

hardware

actor source+ network

high-level synthesis

autonomousschedule

Page 24: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Static scheduling: token rates

actor Add () Input1, Input2 ==> Output:

action [a], [b] ==> [a + b] endend

actor AddSeq () Input ==> Output:

action [a, b] ==> [a + b] endend

2 1

11

1

Page 25: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Static scheduling: token rates

actor Split () Input ==> Pos, Neg:

action [a] ==> Pos: [a]guard a >= 0 end

action [a] ==> Neg: [a]guard a < 0 end

end

actor Multiplex () S, A, B ==> Output:

action S: [sel], A: [v] ==> [v]guard sel end

action S: [sel], B: [v] ==> [v]guard not sel end

end

?1

?

?

1?

1

Page 26: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

26

• SDF: static, fixed token exchange rates, compile-time analysis

• CSDF: extension of SDF with the notionof state, the same compile-timeproperties

• Dynamic extensions: different aspects of dynamicbehavior, run-time changes of the topology or token exchange rates

• DDF: firing rules and firing functionscan be data-dependent, run-time analysis only

Dataflow programs: MoCs

DDF

Dynamic extensions to SDF

CSDF

SDF

IDF

BDF

PSDF

SADF

SPDFBPDF

TPDF

Page 27: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

27

• SDF: static, fixed token exchange rates, compile-time analysis

• CSDF: extension of SDF with the notionof state, the same compile-timeproperties

• Dynamic extensions: different aspects of dynamicbehavior, run-time changes of the topology or token exchange rates

• DDF: firing rules and firing functionscan be data-dependent, run-time analysis only

Dataflow programs: MoCs

DDF

Dynamic extensions to SDF

CSDF

SDF

IDF

BDF

PSDF

SADF

SPDFBPDF

TPDF

ATS

Page 28: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Static scheduling conundrum

Synchronous Dataflow(SDF)

Dataflow

static token rates arbitrary token rates, may change over time, be data dependent

static schedule, static buffer analysis, guaranteed deadlock-free

no static schedule, no hard buffer-bounds at compile time, deadlock undecidable

efficient implementation due to compile-time analysis, low or zero runtime overhead

runtime scheduling, requires runtime decision making

Page 29: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Static scheduling

Synchronous Dataflow

(SDF)

Datafloww/

CALDataflow

static token ratesarbitrary token rates, may change over time, be data dependent

arbitrary token rates, may change over time, be data dependent

static schedule, static buffer analysis, guaranteed deadlock-free

partial static scheduling: statically schedulable when possible

no static schedule, no hard buffer-bounds at compile time, deadlock undecidable

efficient implementation due to compile-time analysis, low or zero runtime overhead

adaptive implementation: use compile-time information, make dynamic choices at runtime

runtime scheduling, requires runtime decision making

Page 30: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Static dataflow scheduling

A

CB

ACABC

112

2 1

1 12

statically schedulable networks desirableefficient software implementationsno deadlocks, yet small buffersreduced synchronization overhead (in hardware).

Real-world dataflow networks are seldom completely statically schedulable.

Use internal actor structure to identify analyzable regions.

Page 31: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

31

State-of-the-art: (dataflow) parallel implementations

high

low

expressiveness facility of developing efficient implementations

analyzability

DDF

TPDF

BPDFSPDFSADFPSDF

IDFBDF

SDFCSDF

Page 32: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

32

State-of-the-art: (dataflow) parallel implementations

high

low

expressiveness facility of developing efficient implementations

analyzability

DDF

TPDF

BPDFSPDFSADFPSDF

IDFBDF

SDFCSDF

Page 33: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

33

State-of-the-art: (dataflow) parallel implementations

high

low

expressiveness facility of developing efficient implementations

analyzability

DDF

TPDF

BPDFSPDFSADFPSDF

IDFBDF

SDFCSDF

Page 34: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Sequential versus Dataflow

Page 35: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

CAL MPEG-4 SP DECODER

Design Unit MPEG-4 SP CAL lines of

code

MPEG-4 Part 7 (Optimized reference SW) lines of C code

MPEG-4 SP (Optimized

Proprietary SW) lines of C code

Parser (2 actors)

1585

AC/DC Reconstruction (3 actors)

468

2D-IDCT (2 actors)

52

Motion Compensation (4 actors)

631

TOTALS 2736 31452 (.src + .h files) 6115 with display control and other tools (.src + .h files)

Page 36: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Dataflow programing framework

Open RVC-CAL Compiler (Orcc)Flexible compilation infrastructureSource-to-source: Hardware and/or software codeMulti-target: GPP (x86, ARM, i7, ….), FPGA, …

.ir

actor.cal

Front-endSource code

Targetcompiler / synthesizer

Back-ends Build script

Intermediaterepresentation

Page 37: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Cal Actor Language: Tool infrastructure

FrontendIntermediate representation simpler than the direct AST

Backends TransformationsClassification, SSA transformation, 3AC, repeat transformation…Interpreter, Profiling analysis (EPFL)Quasi static analysis (Oulu)Code generation: actor instantiationC / C++Verilog (Xilinx)C – Vivado HLSTTA (network + actor in pure LLVM code generation) (IRISA)Promela backend (Abo)Code generation: actor as classLLVM (LLVM + metadata)

Page 38: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Code generation with Orcc

A1

A2B

J

LLVM

JADELLVM JIT compilerA.bc B.bc

decoder.xdf

A2.c B.cGcc

Decoder.exe

A1.c

decoder.c

J

Java Virtual Machine

A.class B.class

HotspotJava JIT

compiler

decoder.javaOrcc

Page 39: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

39

CAL HLS Tool Flow based on Xronos

Xronos HLS Tool flow

AST

IR

OR

CC

IR Transform

ations

Parser

ORCCORCC

XRONOS

OR

CC

IR H

DL

Transformations

OR

CC

IR TO

LIM IR

LIM IR

Optim

izationsA

nd Scheduling

VerilogGenerator

Testbench

Generator

ReportGenerator

OpenForge

Xronos offers a full support of the RVC-CAL programming language

CAL HDL

• Xronos analyzes the source and determines:• Inputs & Outputs• Operations• Data dependencies

• A control/dataflow graph (CDFG) is constructed:• With no timing information• Represents the data flow for the functionality

of the design

• The graph shows data dependencies:• Operations that must complete before others

can begin• e.g (a*b) must be complete before (a*b)+c

• The graph shows control dependencies:• If/else constructs will turn into mux operators• Loops constructs to a set of @always

Page 40: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

40

Xronos HLS – OpenForge Optimizations

Xronos HLS Optimizations

Memory Optimizations:

• Dual port memory conversion, one clock for read or write• Memory ROM reduction, reduces the allocated size of each element• Memory merging, reduces size by grouping memories• Memory resizing, resizes the length of a memory to a power of 2 • Read only field reducing, ROM to constant• ROM replicating, reduces the clock access by replicating the ROM

Control Flow Optimization:

• Loop unrolling, unrolls loops• Pipelining, inserting registers to break the combinatorial path into multiple stages

Page 41: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

41

Xronos HLS – Actors I/Os

Xronos HLS Actor IO

actor ACTOR() int IN ==> int OUT:…

end

Each input and output of an actor is transformed into a double handshake interface

Actor

IN_DATA (31:0)

IN_SEND

IN_ACK

IN_RDY

OUT_DATA (31:0)

OUT_SEND

OUT_ACK

OUT_RDYCLK

RESET

Page 42: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

42

Xronos HLS – Dataflow Application

A

IN_A_DATA

IN_A_SEND

IN_A_ACK

IN_A_RDY

OUT_A_DATA

OUT_A_SEND

OUT_A_ACK

OUT_A_RDY

F_A

IN_DATA

IN_SEND

IN_ACK

IN_RDY

OUT_DATA

OUT_SEND (1:0)

OUT_ACK (1:0)

OUT_RDY(1:0)

Q_B

IN_DATA

IN_SEND

IN_ACK

IN_RDY

OUT_DATA

OUT_SEND

OUT_ACK

OUT_RDY

Q_C

IN_DATA

IN_SEND

IN_ACK

IN_RDY

OUT_DATA

OUT_SEND

OUT_ACK

OUT_RDY

B

IN_B_DATA

IN_B_SEND

IN_B_ACK

IN_B_RDY

OUT_B_DATA

OUT_B_SEND

OUT_B_ACK

OUT_B_RDY

C

IN_C_DATA

IN_C_SEND

IN_C_ACK

IN_C_RDY

OUT_C_DATA

OUT_C_SEND

OUT_C_ACK

OUT_C_RDY

Xronos HLS Network of Actors

• Each input port is connected to a queue.

• If an output port feeds more the one input port a connection with appropriate fanout is inserted.

Page 43: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

43

Xronos HLS – Inside an Actor

a0

GO DONE

R_Counter W_Counter

a0_OUT

a0_IN OUT_DATA

a1

GO DONE

R_Counter W_Counter

a1_IN

IN_DATA

a1_Value OUT

a0_Value

Counter

Scheduler

a0_GO

a1_GOa1_DONE

a0_DONE

IN_SEND

OUT_RDY

OUT_RDY

IN_SEND

OUT_SEND

IN_ACKa1_OUT_SENDa1_OUT_DATA

a1_IN_ACK

a0_OUT_SENDa0_IN_ACK

IN_DATA

Xronos HLS Modules of an Actor

actor CondCounter() int IN ==> int OUT:int Counter := 0;a0: action In:[token] ==> OUT:[Counter]guard token >= 0do

Counter := Counter + 1;end

a1: action In:[token] ==> OUT:[Counter]guard token < 0do

Counter := Counter + 2;end…schedule fsm s0:

s0 (a0) s1;s1 (a1) s2;…

endend

Page 44: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

44

Xronos HLS – Inside the Actor scheduler

Scheduler

a0_GO

a1_GOa1_DONE

a0_DONE

OUT_RDY

IN_SEND

Guard a0 (token => 0)

TokenOutput Available

Result

Guard a1 (token < 0)

TokenOutput Available

Result

Action Scheduler

a0_GO

a1_GO

a0_Guard

a1_Guard

a0_DONEa1_DONER_State W_State

Scheduler State

Value OUT

Xronos HLS Actor Scheduler

IN_DATA

Output Available

Output Available

Page 45: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

45

Xronos HLS – Clock Domains

A

IN_A_DATA

IN_A_SEND

IN_A_ACK

IN_A_RDY

OUT_A_DATA

OUT_A_SEND

OUT_A_ACK

OUT_A_RDY

Q_B

IN_DATA

IN_SEND

IN_ACK

IN_RDY

OUT_DATA

OUT_SEND

OUT_ACK

OUT_RDY

B

IN_B_DATA

IN_B_SEND

IN_B_ACK

IN_B_RDY

OUT_B_DATA

OUT_B_SEND

OUT_B_ACK

OUT_B_RDYCLK CLKA CLKB CLK

CLKF1 CLKF2

Xronos HLS Clock Domains

Multi-Clock Domain Support• The user may attribute (graphically or by a file) a different clock domain for each actor• A Multi-Clock queue is inserted• Power reduction with multiple clock domains and clock gating techniques

Page 46: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

46

Xronos HLS – Testbench Generation

HLS Testbenches

Automatic testbench generation:• Only the input and output tokens, generated by the CAL simulator

or a synthesized C code is needed• Error detection, simulation stops and shows in which token the

error has occurred• VHDL testbench for each actor and network

Input files generated for “Modelsim”:• TCL file for each actor and network that:

- Adds all inputs and outputs of the actor or network (in case of actor simulation or network simulation)

- Adds the “Go” and “Done” signals for each action

Page 47: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

47

Xronos HLS – Testbench Generation

Xronos HLS Testbenches

Network I/O

Actor I/O

Go & Done of Actions

• Comparison with Gold Reference• Error detection

Page 48: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

CAL DATAFLOW THE DESIGN FLOW

CAL DATAFLOW PROGRAMMING(ORCC)

DESIGN SPACE EXPLORATION(TURNUS)

HDL SYNTHESIS (XRONOS)

SW SYNTHESIS (ORCC backends)

Constraints

CAL library

CAL Program

Functional Verification

Design Space Exploration

Interface Synthesis

Architecture Model

Mapping {Partitioning

and Scheduling}

CAL Profiling

Performance Analysis Data and

Refactoring Directions

Refactored CAL Program

Hardware Code

Generation

Software Code

Generation

HDL Library

RTL Synthesis Compilation

RTL Implementation

SW Executable

Runtime Performance

Data

Refactoring Decisions

Iterations

?

Page 49: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

The structure of a computation

actor F () A ==> X:

action A: [a] ==> X: [f(a)] endend

actor Src () ==> X:

s := 0;

action ==> X: [s] do

s := h(s);end

end

Src

actor Snk () A ==> X:

action A: [a] ==> endend

F

Snk

Src

F

Snk

Src

F

Snk

actor F () A ==> X:

x := 0;

action A: [a] ==> X: [f(x)] do

x := g(a, x);end

end

Page 50: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

50

8x8 2D IDCT

Profiling: traces of a decoder

parsing, then IDCT

tail end of reconstruction, followed by IDCT

Page 51: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

51

Actor CopyToken X ==> Zcounter = 0;copy: action X[x] ==> Z[z]

docounter = counter + 1;z = x;

endend

Actor Generator ==> Ocounter = 0;gen: action ==> O[counter]

guard counter < 3do

counter = counter + 1;end

end

Actor Consumer X, Y ==>counter = 0;cons: action X[x], Y[y] ==>

docounter = counter + 1;

endend

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

Execution Trace Graph generation

Page 52: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

52

• Partitioning: all actors in one processing component• Scheduling: pre-defined static scheduling• Buffer dimensioning: infinite buffer sizes

Example 1)

Page 53: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

53

A

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

BCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Example 1)

Page 54: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

54

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:1

Example 1)

Page 55: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

55

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:1

Example 1)

Page 56: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

56

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: C:copy:1

Example 1)

Page 57: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

57

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: D:copy:1

Example 1)

Page 58: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

58

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: E:copy:1

Example 1)

Page 59: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

59

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: F:copy:1

Example 1)

Page 60: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

60

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: G:cons:1

Example 1)

Page 61: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

61

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

At the end of the execution

Example 1)

Page 62: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

62

• Partitioning: each actor in a separate processing component• Scheduling: pre-defined static scheduling• Buffer dimensioning: infinite buffer sizes

Example 2)

Page 63: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

63

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Example 2)

Page 64: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

64

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:1

Example 2)

Page 65: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

65

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:2, B:copy:1, D:copy:1

Example 2)

Page 66: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

66

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:3, B:copy:2, C:copy:1D:copy:2, E:copy:1

Example 2)

Page 67: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

67

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:3, C:copy:2, D:copy:3E:copy:2, F:copy:1

Example 2)

Page 68: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

68

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: C:copy:3, E:copy:3, F:copy:3G:cons:1

Example 2)

Page 69: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

69

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: F:copy:3, G:cons:2

Example 2)

Page 70: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

70

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: G:cons:3

Example 2)

Page 71: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

71

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

At the end of the execution

Example 2)

Page 72: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

72

• Partitioning: each actor in a separate processing component• Scheduling: pre-defined static scheduling• Buffer dimensioning: minimal buffer sizes (1)

Example 3)

Page 73: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

73

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Example 3)

Page 74: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

74

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:1

Example 3)

Page 75: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

75

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:1, D:copy:1

FIFO FULL!

Example 3)

Page 76: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

76

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:1, D:copy:1

Example 3)

Page 77: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

77

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:2, C:copy:1, E:copy:1

Example 3)

Page 78: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

78

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

FIFO FULL!

Example 3)

Page 79: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

79

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:2, D:copy:2, F:copy:1

Example 3)

Page 80: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

80

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:2, D:copy:2, F:copy:1

FIFO FULL!

Example 3)

Page 81: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

81

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:3, E:copy:2, G:cons:1

Example 3)

Page 82: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

82

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: A:gen:3, E:copy:2, G:cons:1FIFO

FULL!

Example 3)

Page 83: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

83

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: C:copy:2, D:copy:3, F:copy:2

Example 3)

Page 84: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

84

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: B:copy:3, E:copy:3, G:cons:2

Example 3)

Page 85: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

85

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: C:copy:3, F:copy:3

Example 3)

Page 86: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

86

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

Firing Step: G:cons:3

Example 3)

Page 87: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

87

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

ABCDEFG

Execution Gantt ChartExecution Trace Graph

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

At the end of the execution

Example 3)

Page 88: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

88

Example 2 (infinite buffer size)

ABCDEFG

ABCDEFGExample 3 (minimal buffer size)

Firings A

Firings B Firings C

Firings D Firings E Firings F

Firings G

AGenerator

BCopyToken

CCopyToken

DCopyToken

ECopyToken

FCopyToken

GConsumer

The same resulting trace, different execution times

Page 89: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

89

Execution trace graph

dataflow network: MPEG4-SP decoder17 actors, 38 buffers

ETG for a few frames of input stimulus176 649 nodes, 1 609 543 arcs

• Structural model of a dynamic execution• Abstract from time• Model for Design Space Exploration• Generated for a specific input stimulus• Carries the information about the potential parallelism

Page 90: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

90

Performance EstimationExecution Trace Post-Processing

• For each fired action (i.e. node of the execution trace) its Computational Load is measured• The dependencies set is extended according to the partitioning/scheduling configuration

Weighted Execution Trace

time

Vi

Vi Computational Load

write/read delay

action selection execution

The Execution Trace Performance Estimation

Page 91: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

91

Execution Trace Critical Path

The Critical Path Definition

Critical Path Algorithm O(n)Determine the design bottlenecks for improving the final throughput• Most Critical Actors and Actions• Most Critical Buffers (only for bounded buffer configuration analysis)

Definition – Critical path:the longest time-weighted sequence of events from the start of the program to its termination.

Page 92: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

92

Critical Path Metrics

The Critical Path Metrics and Analysies

Critical Path Analysis Techniques:• Critical Actions/Actors Ranking• Impact Analysis• Critical Path Gradient (i.e. refactoring directions)

NOTE: these metrics are used to estimate the possible design improvements obtained reducing the algorithmic complexity of an action.

timeVi

Vi Computational Load

write delay

action selection execution

Page 93: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

93

The Co-Exploration Framework

TURNUS Functionality Overview

DATAFLOW DESIGNEXPLORATIONCAL Profiler

Causation Trace Analysis and Post-Processing (Heuristics)

Computational Load

Critical PathMeasurement

Buffer SizeMin. & Optim.

MCD Partitioning

SW and HDL Code Synthesis (e.g. C/C++, C STHorm, Xronos HDL)

CAL Dataflow Program Architecture Model

Design Mapping:Partitioning, Scheduling Buffer Dimensioning

Cycle-Accurate Profiling Information

SW-HW Implementation

Causation Trace

Page 94: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

94

Experimental Results: Design Case

CP Analyses Scenario

Application:MPEG4-SP decoder, RVC-CAL

Platforms:STMicroelectronics STHormVirtex-5 FPGA

Input Sequences:4 different 10-frame QCIF bit streams (Akyio, Foreman, Suzie, News)

Execution Trace:Steps: ≈ 106

Dependencies: ≈ 107

Page 95: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

95

Critical Path Metrics

CP Analyses Actors and Actions Ranking

Critical Actions Ranking: • Actions (and actors) are ranked by their Critical Path Contribution (CPC)

Page 96: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

96

Critical Path Metrics

CP Analyses Impact Analysis

Impact Analysis: Which action need attention

Most Critical Action

different input stimulus

Page 97: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

97

Critical Path Metrics

CP Analyses Impact Analysis

Impact Analysis: Which action need attention

the maximum CL reduction that reduce the CP lengthoptimizing only this action

Page 98: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

98

Buffer Dimensioning Problem: Formulation

Buffer Dimensioning Formulation

Objective: find a trade-off between the buffer size and the critical path length Minimum buffer size configuration for a given throughput

Buffer Size

Criti

cal P

ath

Leng

th

DeadlockRegion

min. Buffers Size

CP length with unbounded buffer size

CP length with min. buffer size

Page 99: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

99

Heuristic Approach

Buffer Dimensioning Heuristics

Divide the problem onto two sub-problems:1. Compute the minimal buffer size2. Optimize only critical buffers (i.e. the buffers that blocks actions in the current CP)

Buffer Size

Criti

cal P

ath

Leng

th

DeadlockRegion

min. Buffers Size

CP length with unbounded buffer size

CP length with min. buffer size

Page 100: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

100

Critical Path Analysis

Buffer Dimensioning

Highlight the Critical Buffer:the one that introduces the highest total blocking latency time along the Critical Path

Highlight Critical Buffers along the Critical Path

time

Vi

Vi Computational Load

write delay

action selection execution

CP Analysis

Page 101: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

101

TURNUS Estimated Results

reasonable buffer size configuration

the CP length is not significantly reduced

Buffer Dimensioning Estimated Results

Page 102: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

102

Estimated Results v.s. Implementation Results

Buffer Dimensioning Estimated vs Implementation

Page 103: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

103

Design case: MPEG-4 AVC/H.264 decoder

• Initial performance summary

Design space exploration

MPEG-4 video decoders

Page 104: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

104

MPEG-4 AVC/H.264 decoder• Decoder_Y design points summary

Design space exploration

MPEG-4 video decoders

Page 105: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

105

MPEG-4 AVC/H.264 decoder

Design space exploration

MPEG-4 video decoders

Page 106: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

106

Case study-2: MPEG-4 AVC/H.264 decoder (cont.)

• Decoder_Y Pareto analysis

• Throughput optimal design points for resource (slice LUT) and power (frequency)

Design space exploration

MPEG-4 video decoders

Page 107: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

107

Case study-2: MPEG-4 AVC/H.264 decoder (cont.)

•Design points comparison (original vs best)• ~18% code change for ~13x throughput gain• Less than ~5% code rewriting, essentially “cut and paste”

Design space exploration

MPEG-4 video decoders

Page 108: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

108

Case study-2: MPEG-4 AVC/H.264 decoder (cont.)

• Comparison with related works

• Compares favorably with similar works• Rapidly obtain design alternatives in exploration space

Design space exploration

MPEG-4 video decoders

[58] FastVdo http://fastvdo.com/FV264. FV264-H.264/AVC ASIC IP CORE.[59] CoreEL Technologies http://www.coreel.com/pages/productsDigitalVideoH264CBPDecode.aspx, H.264 CBP Decoder.[114] M. Thadani et al. “ESL flow for a hardware H.264/AVC decoder using TLM-2.0 and high level synthesis: a quantitativestudy.” In Proceedings of SPIE - The International Society for Optical Engineering, 2009.[119] S. . Wang, et al. “A software-hardware co-implementation of MPEG-4 Advanced Video Coding (AVC) decoder withblock level pipelining.” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 2005..

Page 109: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

109

Multi Clock Domain Partitioning

Introduction MCD Partitioning

Objective: partitioning the dataflow applications onto a MCD architectures• Reduce the overall power consumption• Without impacting the (nominal) performances

ck3

ck2

ck2

ck1

ck1

ck1 ck3

Partitioning Example

Page 110: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

110

Multi Clock Domain Partitioning

MCD Partitioning Problem Formulation

Objective: partitioning the dataflow applications onto a MCD architectures• Reduce the overall power consumption• Without impacting the (nominal) performances

Find a mapping configuration Actor Clock Domain (i.e. clock frequency)

such that: we simplify the relation power - frequency with a linear function

we define the performanceconstraint in terms ofCritical Path Length

Page 111: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

111

Multi Clock Domain Partitioning

MCD Partitioning Analytical

Idea: slow down only actors outside the critical path

• The critical path length defines the longest serial part of the program execution• The execution time of an action is directly related to its clock frequency

Stretch the graph in order to have all the parallel path with the same length

Page 112: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

112

Multi Clock Domain Partitioning

MCD Partitioning Why Dataflow?

Different kinds of Arcs:

• Action Firings Dependencies

• Actions Firing inside the critical path

• Actions Firing outside the critical pathThe slow down factor is a free variable

(i.e. extend the arc’s length)

Outside the CP

Along the CP

Page 113: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

113

Multi Clock Domain Partitioning

MCD Partitioning Analytical

Analytical Problem Formulation: (LP-formulation)

Mapping Function (Actor Clock Domain):

Critical Path

Outside the Critical Path

Page 114: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

114

Multi Clock Domain Partitioning

MCD Partitioning Heuristic

Heuristic Approach: (sub-optimal!)Try to slow down iteratively the clocks for actors outside the critical path

If slowing down a clock the clock of an actor outside the critical path does

not change the performance, this isits new (temporarily) clock domain

Page 115: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

115

Experimental Results: MPEG4-SP decoder

Design Case MPEG4-SP CAL Design

Application:MPEG4-SP decoder41 RVC-CAL actors

Platforms:Virtex-5 FPGA

Execution Trace:Steps: ≈ 106

Dependencies: ≈ 107

Page 116: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

116

Experimental Results: MPEG4-SP decoder

Design Case Results

Power Reduction, compared to nominal performances(i.e. all actors mapped in the clock domain with highest frequency)

Page 117: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

117

• Buffer dimensioning

?→

?

?

????

????

?

??

??

?

?

?

Dataflow programs: design exploration problem

• Partitioning

• Scheduling

Page 118: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

Conclusions

Page 119: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

119

Conclusions

Conclusion

• Dataflow (stream) programming provides the opportunity of a unified high levelsystem design for both SW and HW implementations on heterogeneousplatforms

• Exploration of the VERY LARGE design space (dataflow stream architecture,HW/SW partitioning, buffer dim, SW core partitioning, scheduling, clockassignment, …) and optimization of parameters according to design criteria is aMUST!!

• Very interesting results and design optimizations can be obtained for dynamicdataflow programs by representing and exploring the execution trace

• Still progresses in tools support and further optimizations are underinvestigation: language extensions, design using mixed MoC, automaticsupport for dataflow refactoring, advanced scheduling, joint optimizationsheuristics ……..

Page 120: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

120

Bibliography (1)Simone Casale-Brunet, Małgorzata Michalska, Endri Bezati, Junaid Jameel Ahmad, Marco Mattavelli: “Profiling and high-precision performance estimation of dynamic dataflow programs on multi-core platforms” in “Modeling, Optimization and Simulation for High Performance and Ultra-Low Power Circuits and Systems”, Integration, the VLSI Journal.[To appear in 2018].

Simone Casale-Brunet and M. Mattavelli: “Execution trace graph of dataflow process networks”, IEEE Transactions on Multi-Scale Computing Systems, Volume: PP, Issue: 99, January 2018, Print ISSN: 2332-7766, DOI: 10.1109/TMSCS.2018.2790921

M. Michalska, S. Casale-Brunet , E. Bezati, and M. Mattavelli: “High-Precision Performance Estimation for the Design Space Exploration of Dynamic Dataflow Programs”, IEEE Transactions on Multi-Scale Computing Systems, Volume: PP, Issue: 99, November 2017, Print ISSN: 2332-7766, Digital Object Identifier: 10.1109/TMSCS.2017.2774294.

Casale-Brunet, S; Mattavelli, M.; Janneck, J.W., "Buffer optimization based on critical path analysis of a dataflow program design," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, vol., no., pp.1384,1387, Beijing, China, 19-23 May 2013, doi: 10.1109/ISCAS.2013.6572113

Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "TURNUS: A design exploration framework for dataflow system design," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, vol., no., pp.654,654, Beijing, China, 19-23 May 2013 doi: 10.1109/ISCAS.2013.6571927

- Małgorzata Michalska, Nicolas Zufferey and Marco Mattavelli: "Performance Estimation Based Multicriteria Partitioning Approach for Dynamic Dataflow Programs," Journal of Electrical and Computer Engineering, vol. 2016, Article ID 8536432, 15 pages, 2016. doi:10.1155/2016/8536432, http://www.hindawi.com/journals/jece/2016/8536432/ .

- M. Canale, S. Casale-Brunet, E. Bezati, M. Mattavelli, J. Janneck: "Dataflow Programs Analysis and Optimization Using Model Predictive Control Techniques", Journal of Signal Processing Systems, 2016, Vol: 84, No. 3, Pages 371--381, issn="1939-8115", Doi: 10.1007/s11265-015-1083-4, http://dx.doi.org/10.1007/s11265-015-1083-4 http://link.springer.com/article/10.1007/s11265-015-1083-4

- E. Bezati; S. Casale Brunet; M. Mattavelli; J. W. Janneck, "Clock-gating of streaming applications for energy efficient implementations on FPGAs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. PP, Issue no.99, pp.1-1, Aug. 2016, doi: 10.1109/TCAD.2016.2597215.

- K. Jerbi, D. Renzi, D. De Saint Jorre, H. Yviquel, M. Raulet, C. Alberti, M. Mattavelli: “Development and optimization of high level dataflow programs: the HEVC decoder design case”, Journal of Signal Processing Systems, 2016.http://link.springer.com/article/10.1007/s11265-016-1113-x

Page 121: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

121

Bibliography (2)Anatoly Prihozhy, Endri Bezati, Ab Al-Hadi Ab Rahman, Marco Mattavelli: "Synthesis and Optimization of Pipelines for HW Implementations of Dataflow Programs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 10, October 2015, Pag. 1613-1626.

Jani Boutellier, Johan Ersfollk, Johan Lilius, Marco Mattavelli, Ghislain Roquier, Olli Silven: "Actor Merging for Dataflow Process Networks" IEEE Transactions on Signal Processing, Vol. 63, No. 10, May 15, 2015, Pag. 2496-2508.

C. Sau, P. Meloni, L. Raffo, F. Palumbo, E. Bezati, S. Casale Brunet and M. Mattavelli, “Automated Design Flow for Multi-Functional Dataflow-Based Platforms,” in Journal of Signal Processing Systems -Signal Image and Video Technology-, p. 1-23, 2015.http://link.springer.com/article/10.1007/s11265-015-1026-0

C. Massimo, S. Casale Brunet, E. Bezati, M. Mattavelli and J. Janneck, “Dataflow Programs Analysis and Optimization Using Model Predictive Control Techniques. Two Examples of Bounded Buffer Scheduling: Deadlock Avoidance and Deadlock Recovery Strategies,” in Journal of Signal Processing Systems, p. 1-11, 2015. Doi: 10.1007/s11265-015-1083-4.http://link.springer.com/article/10.1007%2Fs11265-015-1083-4.

M. Michalska, E. Bezati, S. Casale-Brunet and M. Mattavelli, "Buffer dimensioning for throughput improvement of dynamic dataflow signal processing applications on multi-core platforms," 2017 25th European Signal Processing Conference (EUSIPCO), Kos, 2017, pp. 1339-1343, DOI: 10.23919/EUSIPCO.2017.8081426

S. Casale-Brunet, E. Bezati and M. Mattavelli, "Design space exploration of dataflow-based Smith-Waterman FPGA implementations," 2017 IEEE International Workshop on Signal Processing Systems (SiPS), Lorient, France, 2017, pp. 1-6, DOI: 10.1109/SiPS.2017.8109982

S. Casale-Brunet, E. Bezati, A. Bloch and M. Mattavelli, "High-level system synthesis and optimization of dataflow programs for MPSoCs”, "2017 51th Asilomar Conference on Signals, Systems and Computers”, Pacific Grove, CA, 2017.

E. Bezati, S. Casale-Brunet and M. Mattavelli, "Execution Trace Graph based interface synthesis of signal processing dataflow programs for heterogeneous MPSoCs”, "2017 51th Asilomar Conference on Signals, Systems and Computers", Pacific Grove, CA, 2017

E. Bezati, S. Casale Brunet, M. Mattavelli, J. W. Janneck: "High-level system synthesis and optimization of dataflow programs for MPSoCs", in 2016 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, November 6-9, 2016.

E. Bezati; S. Casale Brunet; M. Mattavelli; J. W. Janneck, "High-Level Synthesis of Dynamic Dataflow Programs on heterogeneous MPSoC platforms", in 2016 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

Page 122: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

122

Bibliography (3)Simone Casale-Brunet, Endri Bezati, Marco Mattavelli: “Programming Models and Methods for Heterogeneous Parallel Embedded Systems”, 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Lyon, France, September 21-23 2016.

Simone Casale-Brunet, Endri Bezati, Marco Mattavelli: “High Level Synthesis of Smith-Waterman dataflow implementations”, IEEE 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016

M. Michalska, S. Casale Brunet, E. Bezati, M. Mattavelli, J. Janneck. Trace-Based Manycore Partitioning of Stream-Processing Applications. 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, November 6-9, 2016.

M. Michalska, S. Casale Brunet, E. Bezati, M. Mattavelli: “High-precision performance estimation of dynamic dataflow programs”. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, IEEE MCSoC 2016, Lyon, France, September 21-23 2016.

M. Michalska, N. Zufferey, E. Bezati, M. Mattavelli: “Design space exploration problem formulation for dataflow programs on heterogeneous architectures”. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, IEEE MCSoC2016, Lyon, France, September 21-23 2016.

C. Alberti, N. Daniels, M. Hernaez, J. Voges, R. L. Goldfeder, A. A. Hernandez-Lopez, M. Mattavelli, B. Berger: “An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values”, IEEE 2016 Data Compression Conference, Snowbird, Utah, USA, March 27-29, 2016, DOI 10.1109/DCC.2016.39.

M. Michalska, J.J. Ahmad, E. Bezati, S. Casale-Brunet, M. Mattavelli: “Performance Estimation of Program Partitions on Multi-core Platforms”. International Workshop on Power and Timing Modeling, Optimization and Simulation, Bremen, Germany, September 21-23 2016.

M. Michalska, N. Zufferey, M. Mattavelli: “Tabu Search for Partitioning Dynamic Dataflow Programs”. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, Procedia Computer Science, Volume 80, Pages 1577-1588, 2016.

M. Michalska, E. Bezati, S. Casale-Brunet, M. Mattavelli: “A Partition Scheduler Model for Dynamic Dataflow Programs”. International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, Procedia Computer Science, Volume 80, Pages 2287–2291, 2016.

Page 123: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

123

Bibliography (4)M. Michalska, N. Zufferey, J. Boutellier, E. Bezati and M. Mattavelli. Efficient scheduling policies for dataflow programs executed on multi-core. Proceedings of the 9th International Workshop on Programmability and Architectures for Heterogeneous Multicore (MULTIPROG), Prague, Czech Republic, January 18, 2016.

M. Michalska, J. Boutellier and M. Mattavelli: “A methodology for Profiling and Partitioning Stream Programs on Many-core Architectures: ”International Conference on Computational Science (ICCS), Reykjavik, Iceland, June 1-3, 2015, Procedia Computer Science, Vol 51, pag. 2962-2966. Doi: 10.1016/j.procs.2015.05.498.http://www.sciencedirect.com/science/article/pii/S187705091501306X

M. Michalska, S. Casale-Brunet, E. Bezati and M. Mattavelli: “Execution Trace Graph Based Multi-criteria Partitioning of Stream Programs: ”International Conference on Computational Science (ICCS), Reykjavik, Iceland, June 1-3, 2015, Procedia Computer Science, Vol 51, pag. 1443-1452. Doi: 10.1016/j.procs.2015.05.334http://www.sciencedirect.com/science/article/pii/S1877050915011424

- Canale, M.; Casale-Brunet, S.; Bezati, E.; Mattavelli, M.; Janneck, J.W., “Dataflow Programs Analysis and Optimization using Model Predictive Control Techniques: an example of bounded buffer scheduling”, SiPS, 2014, Belfast, North Ireland.http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6986054DOI: 10.1109/SiPS.2014.6986054

- Casale-Brunet, S.; Bezati, E.; Mattavelli, M.; Canale, M.; Janneck, J.W., “Execution trace graph analysis of dataflow programs: bounded buffer scheduling and deadlock recovery using model predictive control”, Conference on Design and Architectures for Signal and Image Processing (DASIP 2014), October 2014, Madrid, Spain. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7115623DOI: 10.1109/DASIP.2014.7115623

- S. Casale Brunet, M. Michalska, E. Bezati, M. Mattavelli, J. Janneck and M. Canale: “TURNUS: an open-source design space exploration framework for dynamic stream programs,”, Conference on Design and Architectures for Signal and Image Processing (DASIP), October2014, Madrid, Spain.http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7115614DOI: 10.1109/DASIP.2014.7115614

- ICIP 2014 Damien de Saint Jorre, Claudio Alberti, Marco Mattavelli, Simone Casale-Brunet "Exploring MPEG HEVC decoder parallelism for the efficient porting onto many-core platforms" 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27-30 October 2014, p. 2115-2119, doi:10.1109/ICIP.2014.7025424.

Page 124: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

124

Bibliography (5)E. Bezati, S. Casale-Brunet, M. Mattavelli, J.W. Janneck: " Coarse Grain Clock Gating of Streaming Applications In Programmable Logic Implementations," Proceedings Of The 2014 Electronic System Level Synthesis Conference (Eslsyn), 2014, San Francisco, CA, USA, Doi:10.1109/ESLsyn.2014.6850387, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6850387.

C. Sau, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, M. Mattavelli: “Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL Multi-Standard Decoder Use-Case”, IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014, Agios Konstantinos, Greece,DOI: 10.1109/SAMOS.2014.6893195http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6893195.

J.W. Janneck., S. Casale-Brunet, M. Mattavelli: “Characterizing Communication Behaviour of Dataflow Programs Using Trace Analysis”, IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014, AgiosKonstantinos, Greece, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6893193DOI: 10.1109/SAMOS.2014.6893193.

Rahman, A.A.A, Casale-Brunet S., Alberti C., Mattavelli M., “A Methodology for Optimizing Buffer Sizes of Dynamic Dataflow FPGAs Implementations”, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2014), Firenze, Italy,http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854554DOI: 10.1109/ICASSP.2014.6854554

Ersfolk, J.; Roquier, G.; Lund, W.; Mattavelli, M. & Lilius, J., “Static and Quasi-static Compositions of Stream Processing Applications from Dynamic Dataflow Programs”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013.

Bezati, E.; Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "Synthesis and optimization of high-level stream programs," Electronic System Level Synthesis Conference (ESLsyn), 2013 , pp.1-6, Austin, Texas, USA, May 31-June 1, 2013.

Bezati, E., Thavot, R., Roquier, G., & Mattavelli, M. High-level dataflow design of signal processing systems for reconfigurable and multicore heterogeneous platforms. Journal of Real-Time Image Processing, 2013, doi:10.1007/s11554-013-0326-5.

Bezati, E.; Roquier, G.; Mattavelli, M., "Live demonstration: High level software and hardware synthesis of dataflow programs," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on , vol., no., pp.660,660, Beijing, China, 19-23 May 2013, doi: 10.1109/ISCAS.2013.6571930.

Page 125: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

125

Bibliography (6)Casale-Brunet, S.; Mattavelli, M.; Alberti, C.; Janneck, J.W., "Representing Guard Dependencies in Dataflow Execution Traces," Computational Intelligence, Communication Systems and Networks (CICSyN), 2013 Fifth International Conference on , vol., no., pp.291,295, Madrid, Spain, 5-7 June 2013, doi: 10.1109/CICSYN.2013.26,

Casale-Brunet, S.; Mattavelli, M.; Alberti, C.; Janneck, J.W., “Design Space Exploration of High Level Stream Programs on Parallel Architectures: A focus on the Buffer Size Minimization and Optimization Problem,” 8th International Symposium on Image and Signal Processing and Analysis (ISPA 2013), Trieste, Italy, 4-6 September 2013

Casale-Brunet, S.; Alberti, C.; Mattavelli, M.; Janneck, J.W., “TURNUS: a unified dataflow design space exploration framework for heterogeneous parallel systems,“ Conference on Design & Architectures for Signal & Image Processing (DASIP), Cagliari, Italy, 8-10 October 2013

Casale-Brunet, S.; Bezati, E.; Roquier, G., Alberti, C.; Mattavelli, M.; Janneck, J.W.; Boutellier, J., “Design Space Exploration and Implementation of RVC-CAL Applications using the TURNUS framework,“ Conference on Design & Architectures for Signal & Image Processing (DASIP), Cagliari, Italy, 8-10 October 2013

Casale-Brunet, S.; Bezati, E.; Alberti, C.; Mattavelli, M.; Amaldi, E.; Janneck, J.W., " Partitioning and Optimization of high level Stream applications for Multi Clock Domain Architectures," Signal Processing Systems (SiPS), 2013 IEEE Workshop on, Taipei, Taiwan, 16-18 October 2013

Casale-Brunet, S.; Bezati, E.; Alberti, C.; Mattavelli, M.; Amaldi, E.; Janneck, J.W., “Multi-clock domain optimization for reconfigurable architectures in high-level dataflow applications,” Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 3-6 November 2013

Casale-Brunet, S.; Alberti, C.; Mattavelli, M.; Janneck, J.W., “Systems Design Space Exploration by Serial Dataflow Program Executions,” Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 3-6 November 2013

Casale-Brunet, S.; Elguindy, A.; Bezati, E.; Thavot, R.; Roquier, G., Mattavelli, M.; Janneck, J.W., “Methods to explore design space for MPEG RMC codec specifications,” Signal Processing: Image Communication, 2013, ISSN 0923-5965, DOI: 10.1016/j.image.2013.08.012.

Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "Profiling of Dataflow Programs Using Post Mortem Causation Traces," Signal Processing Systems (SiPS), 2012 IEEE Workshop on , vol., no., pp.220,225, Québec City, Canada, 17-19 October 2012, doi: 10.1109/SiPS.2012.54,

Page 126: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

126

Bibliography (7)Bezati, E.; Casale-Brunet, S.; Mattavelli, M.; Janneck, J.W., "Synthesis and optimization of high-level stream programs," Electronic System Level Synthesis Conference (ESLsyn), 2013 , vol., no., pp.1,6, Austin, Texas, USA, May 31-June 1 2013

Bezati, E., Thavot, R., Roquier, G., & Mattavelli, M. High-level dataflow design of signal processing systems for reconfigurable and multicoreheterogeneous platforms. Journal of Real-Time Image Processing, 2013, doi:10.1007/s11554-013-0326-5

Bezati, E.; Roquier, G.; Mattavelli, M., "Live demonstration: High level software and hardware synthesis of dataflow programs," Circuits and Systems (ISCAS), 2013 IEEE International Symposium on , vol., no., pp.660,660, Beijing, China, 19-23 May 2013, doi: 10.1109/ISCAS.2013.6571930

Bezati E.; Mattavelli M.; Janneck, J.W, “High-Level Synthesis of Dataflow Programs for Signal Processing Systems”, 8th International Symposium on Image and Signal Processing and Analysis (ISPA 2013), Trieste, Italy, 4-6, September 2013

Rahman, A.A.A; Casale-Brunet, S.; Alberti, C.; Mattavelli, M., “Dataflow Program Analysis and Refactoring Techniques for Design Space Exploration: MPEG-4 AVC/H.264 Decoder Implementation Case Study,“ Conference on Design & Architectures for Signal & Image Processing (DASIP), Cagliari, Italy, 8-10 October 2013

Rahman, A.A.A.; Thavot, R.; Casale-Brunet, S.; Bezati, E.; Mattavelli, M., "Design space exploration strategies for FPGA implementation of signal processing systems using CAL dataflow program," Design and Architectures for Signal and Image Processing (DASIP), 2012 Conference on, vol., no., pp.1,8, Karlsruhe, Germany, 23-25 October 2012

Ahmad, J.J.; Li, S.; & Mattavelli, M., "Performance Benchmarking of RVC based Multimedia Specifications,” Proceedings of the 20th IEEE International Conference on Image Processing (ICIP), Melbourne Australia, September 2013 © IEEE.

Ahmad, J.J.; Li, S.; Thavot, R.; & Mattavelli, M., "Secure Computing with the MPEG RVC Framework," Signal Processing: Image Communication, August 2013 © Elsevier Science B. V., DOI: 10.1016/j.image.2013.08.015

Ersfolk, J.; Roquier, G.; Lund, W.; Mattavelli, M. & Lilius, J., “Static and Quasi-static Compositions of Stream Processing Applications from Dynamic Dataflow Programs”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013.

M. Mattavelli: "MPEG Reconfigurable Video Representation", in "The MPEG Representation of Digital Media", Chapter 12, Springer Ed. 2011, DOI 10.1007/978-1-4419-6184-6_12. http://www.springer.com/engineering/signals/book/978-1-4419-6183-9http://www.springerlink.com/content/q7728730l4215455/

M. Mattavelli, M. Raulet, J. Janneck: "MPEG Reconfigurable Video Coding" in "Handbook of Signal processing Systems", Pag. 43-68, S. S. Bhattacharyya, Ed F. Deprettere, R. Leupers and J. Takala Editors, Springer 2010. http://www.springer.com/engineering/signals/book/978-1-4419-6344-4

Page 127: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

127

Bibliography (8)Ab Al-Hadi Ab Rahman, Anatoly Prihozhy, Marco Mattavelli: "Pipeline Synthesis and Optimization of FPGA-based Video Processing Applications with CAL", Eurasip Journal on Image and Video Processing, 2011, 2011:19, http://jivp.eurasipjournals.com/content/2011/1/19.

Marco. Mattavelli, Ihab Amer and Mickael Raulet: "The reconfigurable video coding standard", IEEE Signal Processing Magazine, May 2010, Vol27, No. 3, pag. 159-167.

Shuvra S. Bhattacharyya, Gordon Brebner, Johan Eker, Jorn W. Janneck, Marco Mattavelli, Carl von Platen, and Mickael Raulet: “OpenDF - A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems”, ACM SIGARCH Computer Architecture News, Special Issue: MCC08 –Multicore Computing 2008, Volume 36, Number 5, pp 29-35, December 2008.

Ihab Amer, Christophe Lucarz , Marco Mattavelli, Ghislain Roquier, Mickael Raulet, Olivier Deforges, Jean-Francois Nezan: "Reconfigurable Video Coding: the Video Coding Standard for Multi-core Platforms", IEEE Signal Processing Magazine, Special issue on Multicore Platforms, Vol 26, No. 6, November 2009, pag 113-123.

Mickael Raulet, Marco Mattavelli, Jorn Janneck: "Guest editorial: Special Issue on Reconfigurable Video Coding", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0418-4, Link - http://www.springerlink.com/content/k0157333g2602l55/.

Christophe Lucarz, Jonathan Piat, Marco Mattavelli: "Automatic synthesis of parsers and validation of bitstreams within the MPEG Reconfigurable Video Coding Framework", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0395-7 Link -http://www.springerlink.com/content/gg261p657x483m91.

Hussein K Aman-Allah, Karim M Maarouf, Ehab A Hanna, Ihab M Amer, Marco Mattavelli: "CAL Dataflow Components For an MPEG RVC AVC Baseline Encoder", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0396-6 Link -http://www.springerlink.com/content/j487766m06306127.

Shuvra S. Bhattacharyya, Johan Eker, Jorn W. Janneck, Christophe Lucarz, Marco Mattavelli, Mickael Raulet: "Overview of the MPEG Reconfigurable Video Coding Framework", Journal of Signal Processing Systems, 2009, DOI - 10.1007/s11265-009-0399-3 Link -http://www.springerlink.com/content/24t3610261j1628j.

Page 128: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

128

Credits and thanks to:

Conclusion

• Prof. Jorn Janneck, Prof. Edoardo Amaldi (Lund University, Politecnico diMilano)

• Simone Casale Bruner, Endri Bezati, Malgorzata Michalska, Claudio Alberti,Ghislain Roquier, Ab Al-Hadi Ab Rahman (EPFL)

• Dave Parlour, Ian Miller (Xilinx)

• Mickael Raulet, Mathieu Wipliez, JF Nezan (INSA)

• Johan Eker, Carl von Platen (Ericsson)

• And many others ………

Page 129: Prezentacja programu PowerPoint...modules, 3) makes minimal assumptions about the physical architecture of the computing machine it is implemented on, 4) is efficiently implementable

129

Thanks for your attention!