Pl Standard Toolkit Reference
Transcript of Pl Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
1/78
IBM InfoSphere StreamsVersion 2.0.0.4
IBM Streams Processing LanguageStandard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
2/78
-
8/22/2019 Pl Standard Toolkit Reference
3/78
IBM InfoSphere StreamsVersion 2.0.0.4
IBM Streams Processing LanguageStandard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
4/78
NoteBefore using this information and the product it supports, read the general information under Notices on page 63.
Edition Notice
This document contains proprietary information of IBM. It is provided under a license agreement and is protectedby copyright law. The information contained in this publication does not include any product warranties, and anystatements provided in this manual should not be interpreted as such.
You can order IBM publications online or through your local IBM representative.
v To order publications online, go to the IBM Publications Center at www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
v To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at www.ibm.com/planetwide
When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in anyway it believes appropriate without incurring any obligation to you.
Copyright IBM Corporation 2011, 2012.US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contractwith IBM Corp.
http://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wsshttp://www.ibm.com/planetwidehttp://www.ibm.com/planetwidehttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss -
8/22/2019 Pl Standard Toolkit Reference
5/78
Summary of changes
This topic describes updates to this documentation for IBM InfoSphere StreamsVersion 2.0 (all releases).
Note: The following revision characters are used in the InfoSphere Streamsdocumentation to indicate updates for Version 2.0.0.4:
v In PDF files, updates are indicated by a vertical bar (|) to the left of eachnew or changed line of text.
v In HTML files, updates are surrounded by double angle brackets(>> and
-
8/22/2019 Pl Standard Toolkit Reference
6/78
v Multiple DirectoryScan operators can scan the same directory simultaneously ifthe processed files are moved to a different directory before generating theoutput tuple.
v The DirectoryScan operator supports custom output functions to provideadditional information about the generated file.
v The interface parameter is added to the TCPSource, TCPSink, and UDPSource
operators to specify the network interface to use when registering the addresswith the name parameter.
v The nConnections metric is added to the TCPSource and TCPSink operators toindicate the number of active TCP/IP connections.
v The append parameter is added to the FileSink operator to append thegenerated tuples to the output file. For more information, see FileSink on page20.
v The ignoreOpenErrors parameter is added to the FileSource operator to readsuccessive files if a file cannot be opened for reading. For more information, seeFileSource on page 15.
v An optional output port is added to the FileSource operator to indicate the filesthat were processed and those that could not be opened successfully.
v If an SPL program or a toolkit uses the new features that are added to theStandard Toolkit in IBM InfoSphere Streams Version 2.0.0.3 , you must set theStandard Toolkit version to 1.0.1 in the info.xml file. For more informationabout the info.xml file and how to set dependencies on other toolkits, see howto create toolkits in the IBM Streams Processing Language Toolkit DevelopmentReference.
Updates for Version 2.0.0.2 (Version 2.0, Fix Pack 2)
v The DirectoryScan operator uses change time (ctime) of the file to detect if thefile has been recreated. For more information, see DirectoryScan on page 23.
v The hasHeaderLine parameter of the FileSource operator supports multiple linesof column names for csv format. For more information, see FileSource on page
15.v A logic clause cannot be specified for the Export operator.
v A config clause cannot be specified for the Import and Export operators.
v If a file is moved to a directory that is on a different file system, a .renamesubdirectory might be created in the target directory for the file move operationto be atomic. For more information, see FileSink on page 20 and FileSourceon page 15.
Updates for Version 2.0.0.1 (Version 2.0, Fix Pack 1)
This guide was not updated for Version 2.0.0.1.
iv IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
7/78
Abstract
This document describes the operators that are provided by the IBM StreamsProcessing Language (SPL) standard toolkit. This standard toolkit is specific to IBM
InfoSphere Streams.
Copyright IBM Corp. 2011, 2012 v
-
8/22/2019 Pl Standard Toolkit Reference
8/78
vi IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
9/78
Contents
Summary of changes. . . . . . . . . iii
Abstract. . . . . . . . . . . . . . . v
Chapter 1. Relational Operators . . . . 1Filter . . . . . . . . . . . . . . . . . 1Functor . . . . . . . . . . . . . . . . 1Punctor . . . . . . . . . . . . . . . . 2Sort . . . . . . . . . . . . . . . . . 3Join . . . . . . . . . . . . . . . . . 5Aggregate . . . . . . . . . . . . . . . 9
Chapter 2. Adapter Operators . . . . . 15FileSource . . . . . . . . . . . . . . . 15FileSink. . . . . . . . . . . . . . . . 20
DirectoryScan. . . . . . . . . . . . . . 23TCPSource. . . . . . . . . . . . . . . 26TCPSink . . . . . . . . . . . . . . . 31UDPSource . . . . . . . . . . . . . . 34UDPSink . . . . . . . . . . . . . . . 37Export . . . . . . . . . . . . . . . . 39Import . . . . . . . . . . . . . . . . 39MetricsSink . . . . . . . . . . . . . . 41
Chapter 3. Utility Operators . . . . . . 43Custom . . . . . . . . . . . . . . . . 43
Beacon . . . . . . . . . . . . . . . . 43Throttle. . . . . . . . . . . . . . . . 44Delay . . . . . . . . . . . . . . . . 45Barrier . . . . . . . . . . . . . . . . 46Pair . . . . . . . . . . . . . . . . . 48Split . . . . . . . . . . . . . . . . . 49DeDuplicate . . . . . . . . . . . . . . 51Union . . . . . . . . . . . . . . . . 52ThreadedSplit . . . . . . . . . . . . . 53DynamicFilter . . . . . . . . . . . . . 54Gate . . . . . . . . . . . . . . . . . 55JavaOp . . . . . . . . . . . . . . . . 57
Chapter 4. Compat Operators . . . . . 59V1TCPSource . . . . . . . . . . . . . . 59V1TCPSink . . . . . . . . . . . . . . 61Compat.Sample . . . . . . . . . . . . . 62
Notices . . . . . . . . . . . . . . 63
Copyright IBM Corp. 2011, 2012 vii
-
8/22/2019 Pl Standard Toolkit Reference
10/78
viii IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
11/78
Chapter 1. Relational Operators
Filter
DescriptionThe Filter operator removes tuples from a stream by passing along onlythose that satisfy a user-specified condition. Non-matching tuples may besent to a second optional output.
Input PortsThe Filter operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.
Output PortsThe Filter operator is configurable with one or two output ports. The firstoutput port is mandatory, non-mutating, and its punctuation mode isPreserving. The second output port is optional, non-mutating and itspunctuation mode is Preserving. The Filter operator requires that thestream type of the output port(s) match the stream type of the input port.The first output port will receive the tuples that match the filterexpression. The second output port, if present, will receive the tuples thatfail to match the filter expression.
ParametersThe Filter operator has the following parameters:
filter This is an optional parameter, which specifies the condition thatdetermines the tuples to be passed along by the Filter operator. Ittakes a single expression of type boolean as its value. When notspecified, it is assumed to be true.
Windowing
The Filter operator does not accept any window configurations.
AssignmentsThe Filter operator does not allow assignments to output attributes. Theoutput tuple attributes are automatically forwarded from the input ones.
composite Main { //1graph //2
stream Beat = Beacon() {} //3stream Youngs = Filter(Beat) //4{ //5
param filter : age < 30u; //6} //7(stream Younger; stream Older) = Filter(Beat) //8{ //9
param filter : age < 30u; //10} //11
} //12
Functor
DescriptionThe Functor operator is used to transform input tuples into output ones,and optionally filter them as in a Filter operator. If you do not filter aninput tuple, any incoming tuple results in a tuple on each output port.
Input PortsThe Functor operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious
Copyright IBM Corp. 2011, 2012 1
-
8/22/2019 Pl Standard Toolkit Reference
12/78
Output PortsThe Functor operator is configurable with one or more output ports. Theoutput ports are mutating and their punctuation mode is Preserving
ParametersThe Functor operator has the following parameters:
filter This is an optional parameter, which specifies the condition that
determines which input tuples are to be operated on by theFunctor operator. It takes a single expression of type boolean as itsvalue. When not specified, it is assumed to be true, i.e., tuples aretransformed, but no filtering is performed.
WindowingThe Functor operator does not accept any window configurations.
AssignmentsThe Functor operator allows assignments to output attributes. The outputtuple attributes whose assignments are not specified are automaticallyforwarded from the input ones. After the automatic forwarding, theFunctor operator expects all output tuple attributes to be completelyassigned.
composite Main { //1graph //2
stream Beat = Beacon() {} //3stream //5Annotated = Functor(Beat) //6
{ //7param filter : age >= 18u; //8output Annotated : login = lower(name), //9
info = { young = (age1000000ul) }; //10} //11(stream Age; //12
stream Salary) = Functor(Beat) //13{ //14
param filter : age >= 18u; //15} //16
} //17
Punctor
DescriptionThe Punctor operator is used to transform input tuples into output onesand add window punctuations to the output.
Input PortsThe Punctor operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.
Output PortsThe Punctor operator is configurable with a single output port. The outputport is mutating and its punctuation mode is Generating.
ParametersThe Punctor operator has the following parameters:
punctuate
This is a mandatory parameter, which specifies the condition thatdetermines when a window punctuation is to be generated. It takesa single expression of type boolean as its value.
position
This is a mandatory parameter, which specifies the position of thegenerated window punctuation with respect to the current tuple.
2 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
13/78
The valid values are before and after. If the value is before, thepunctuation will be generated before the output tuple, otherwise itwill be generated after the output tuple.
WindowingThe Punctor operator does not accept any window configurations.
Assignments
The Punctor operator allows assignments to output attributes. The outputtuple attributes whose assignments are not specified are automaticallyforwarded from the input ones. After the automatic forwarding, thePunctor operator expects all output tuple attributes to be completelyassigned.
composite Main { //1graph //2
stream Beat = Beacon() {} //3stream //5Annotated = Punctor(Beat) //6
{ //7param punctuate : age >= 18u; //8
position : after; // add a punctuation after the generated tuple, //9// if the age is >= 18 //10
output Annotated : login = lower(name), //11info = { young = (age1000000ul) }; //12
} //13} //14
Sort
DescriptionThe Sort operator is used to order tuples based on user-specified orderingexpressions and window configurations.
Input PortsThe Sort operator is configurable with a single input port. The input portis non-mutating and its punctuation mode is WindowBound. The Sortoperator will process window marker punctuations when configured witha punctuation based window.
Output PortsThe Sort operator is configurable with a single output port. The outputport is mutating and its punctuation mode is Generating. The Sortoperator will generate a punctuation after each batch of sorted tuples itoutputs. The Sort operator requires that the stream type for the outputport matches the stream type for the input port.
ParametersThe Sort operator has the following parameters:
sortBy This is a mandatory parameter that specifies one or moreexpressions to be used for sorting the tuples. The sort is performed
in lexicographical manner in ascending order. I.e., the firstexpression will be used first for the comparison and in the case ofequality the second expression will be considered, and so on. Thedefault sort order of ascending implies that the output stream willproduce tuples in non-decreasing order. The sort order can bechanged using the order parameter.
order This is an optional parameter that specifies either the global sortorder, or the sort order for the individual expressions that appearin the sortBy parameter. The valid values are ascending anddescending. When a single value is specified for the order
Chapter 1. Relational Operators 3
-
8/22/2019 Pl Standard Toolkit Reference
14/78
parameter it determines the global sort order. When multiplevalues are specified, then the number of values must match thenumber of sortBy expressions.
partitionBy
This is an optional parameter that is only valid for a Sort operatorconfigured with a partitioned window (see below). It specifies one
or more expressions to be used for partitioning the input tuplesinto sub-windows, where all window and parameter configurationsapply to the sub-windows, independently.
WindowingThe Sort operator supports the following window configurations:
tumbling, (count | delta | time | punctuation)-based eviction(, partitioned (, partitionEvictionSpec)? )?
sliding, count-based eviction, count-based trigger of 1(, partitioned (, partitionEvictionSpec)? )?
For the tumbling variants, tuples are sorted when the window gets fulland are output at once. A window marker punctuation is output at the
end.For the sliding variants, tuples are always kept in sorted order. Once thewindow gets full, every new tuple causes the first one in the sorted orderto be removed from the window and output. This type of sort is referredto as progressive sort.
For the partitioned variants, the window specification applies to individualsub-windows identified by the partitionBy parameter.
For the tumbling variants, the final punctuation marker does not flush thewindow (so as not to break invariants on the output), whereas for thesliding variants (progressive), the final punctuation marker does flush thewindow.
AssignmentsThe Sort operator does not allow assignments to output attributes. Theoutput tuple attributes are automatically forwarded from the input ones.
MetricsThe Sort operator has the following metrics:
v nCurrentPartitions: The number of partitions currently in the windowfor the Sort operator.
composite Main { //1graph //2
stream Beat = Beacon() {} //3// count based window //4stream Sorted0 = Sort(Beat) //5{ //6
window //7
Beat : tumbling, count(10); //8param //9
sortBy : name, (float64)salary/(float64)age; //10} //11// count based partitioned window //12stream Sorted1 = Sort(Beat) //13{ //14
window //15Beat : tumbling, count(10), partitioned; //16
param //17partitionBy : name; //18sortBy : (float64)salary/(float64)age; //19
} //20// count based window, with sort order //21stream Sorted2 = Sort(Beat) //22
4 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
15/78
{ //23window //24
Beat : tumbling, count(10); //25param //26
sortBy : name, (float64)salary/(float64)age; //27order : descending; //28
} //29// count based window, with sort order for each sortBy expression //30stream Sorted3 = Sort(Beat) //31{ //32
window //33Beat : tumbling, count(10); //34param //35
sortBy : name, (float64)salary/(float64)age; //36order : ascending, descending; //37
} //38// punctuation based window //39stream Sorted4 = Sort(Beat) //40{ //41
window //42Beat : tumbling, punct(); //43
param //44sortBy : name, (float64)salary/(float64)age; //45
} //46// time based window //47stream Sorted5 = Sort(Beat) //48{ //49
window //50
Beat : tumbling, time(10); //51param //52sortBy : name, (float64)salary/(float64)age; //53
} //54// delta based window //55stream BeatId = Beacon() {} //56stream Sorted6 = Sort(BeatId) //57{ //58
window //59BeatId : tumbling, delta(id, 10u); //60
param //61sortBy : (float64)salary/(float64)age; //62
} //63// progressive sort //64stream Sorted = Sort(Beat) //65{ //66
window //67Beat : sliding, count(10); //68
param //69sortBy : name, (float64)salary/(float64)age; //70} //71
} //72
Join
DescriptionThe Join operator is used to correlate tuples from two streams based onuser-specified match predicates and window configurations. When a tupleis received on an input port, it is inserted into the window correspondingto the input port, which causes the window to trigger. As part of thetrigger processing, the tuple is compared against all tuples inside thewindow of the opposing input port. If the tuples match, then an output
tuple will be produced for each match. If at least one output wasgenerated, a window punctuation will be generated after all the outputs.
If equalityRHS and equalityLHS parameters are specified, the matching willbe done using a hash table. Otherwise a scan of the tuples in the windowwill be done to find the matches.
In an outer join configuration, if a tuple does not get involved in a matchduring its stay in the join window, then it will be sent out to an outputport right before its eviction from the window. See the algorithmparameter for details.
Chapter 1. Relational Operators 5
-
8/22/2019 Pl Standard Toolkit Reference
16/78
Partitioning may be used to split the tuples into partitioned windows.
Input PortsThe Join operator is configurable with two input ports. The input ports arenon-mutating and their punctuation mode is Oblivious.
Output PortsThe Join operator is configurable with a single output port in the case of
an inner join, one or two output ports in the case of a rightOuter orleftOuter join, and one or three output ports in the case of an outer join.The output ports are mutating. The punctuation mode is Generating forthe first output port and Free for any other output ports that may exist.The Join operator will generate a punctuation after each batch of joinedtuples it outputs on its first output port.
ParametersThe Join operator has the following parameters:
match This optional parameter specifies an expression of type boolean tobe used for matching the tuples. The expression could refer toattributes from both input ports. When omitted, the default valueof true is used.
algorithm
This optional parameter is used to specify the join algorithm to beused. The valid options are leftOuter, rightOuter, outer, andinner. In a left outer join, a tuple that is being evicted from the leftport's window and has never been involved in a match earlier ispaired with a default initialized tuple (whose attributes are defaultconstructed) from the right port and output. If a defaultTupleRHSparameter is specified, its value is used instead of the defaultconstructed tuple. A right outer join is similar, but applies to tuplesthat are being evicted from the right port's window and employsthe defaultTupleLHS parameter if present. An outer join is acombination of left and right outer joins. The default for this
parameter is the inner join option, which does not perform anyaction upon eviction of tuples.
For leftOuter and rightOuter joins, an optional second outputport can be specified. In this case, the evicted tuples that have nomatches are output on the second output port and are not joinedwith an empty tuple from the opposite window. The schema of thesecond output port must match that of the left input port in thecase of a leftOuter join and the right input port in the case of arightOuter join. For an outer join, optional second and thirdoutput ports can be specified. This means that the outer join canhave either one output port or three output ports. When specified,the second port is used to output evicted tuples from the left input
port that have no matches and the third port is used to output theones from the right input port. The schemas of the second andthird output ports must match the schemas of the first and secondinput ports, respectively.
defaultTupleLHS
This optional parameter can be specified to indicate the tuple to beused from the left stream, for matching an expiring tuple from theright window that needs to be output as part of a right outer joinor outer join algorithm. It is only valid for join operators with asingle output port and those that have rightOuter or outer as the
6 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
17/78
join algorithm. It can take a single value of tuple type, which mustmatch the type of the tuples from the left stream.
defaultTupleRHS
This optional parameter can be specified to indicate the tuple to beused from the right stream, for matching an expiring tuple fromthe left window that needs to be output as part of a left outer join
or outer join algorithm. It is only valid for join operators with asingle output port and those that have leftOuter or outer as thejoin algorithm. It can take a single value of tuple type, which mustmatch the type of the tuples from the right stream.
equalityLHS
This optional parameter is used to specify equality conditionexpressions from the left port. The number of expressions and theirtypes must match those from the equalityRHS parameter. Theexpressions could refer to attributes from the left input port only.
equalityRHS
This optional parameter is used to specify equality conditionexpressions from the right port. The number of expressions and
their types must match those from the equalityLHS parameter. Theexpressions could refer to attributes from the right input port only.
The equalityLHS and equalityRHS parameters can be used tospecify equi-join match predicates, which results in using ahash-based join implementation, rather than a nested-loop one.They are not mutually exclusive with the match parameter and can
be used together.
partitionByLHS
This optional parameter specifies one or more expressions to beused for partitioning the input tuples from the left port intosub-windows, where all window and parameter configurationsapply to the sub-windows, independently. It can only be used if a
partitioned window is defined for the left port (see below). Theexpressions could refer to attributes from the left input port only.
partitionByRHS
This optional parameter specifies one or more expressions to beused for partitioning the input tuples from the right port intosub-windows, where all window and parameter configurationsapply to the sub-windows, independently. It can only be used if apartitioned window is defined for the right port (see below). Theexpressions could refer to attributes from the right input port only.
WindowingThe Join operator supports the following window configurations for agiven input port:
sliding, (count | delta | time)-based eviction, count-based triggerof 1 (, partitioned (, partitionEvictionSpec)? )?
All window configurations have a count-based trigger of 1. This meansthat every time a tuple is received on a port, it is inserted into its window,which triggers the join processing. The newly inserted tuple is matchedagainst the tuples resident in the window defined over the other inputport. In case of matches, a result is output for each match and a windowmarker punctuation is output at the end.
Chapter 1. Relational Operators 7
-
8/22/2019 Pl Standard Toolkit Reference
18/78
For the partitioned variants, the window specification applies to individualsub-windows identified by the partitionBy parameter corresponding tothe port. The left input port of the join cannot have a partitioned windowdefined unless a partitionByLHS parameter is specified. Similarly, the rightinput port of the join cannot have a partitioned window defined unless apartitionByRHS parameter is specified.
AssignmentsThe Join operator allows assignments to output attributes. The outputtuple attributes whose assignments are not specified are automaticallyforwarded from the input ones. After the automatic forwarding, the Joinoperator expects all output tuple attributes to be completely assigned.
MetricsThe Join operator has the following metrics:
v nCurrentPartitionsLHS: The number of partitions currently in the lefthand side window for the Join operator.
v nCurrentPartitionsRHS: The number of partitions currently in the lefthand side window for the Join operator.
composite Main { //1
graph //2stream BeatL = Beacon() {} //3stream BeatR = Beacon() {} //4// join with a match condition //5stream Join1 = Join(BeatL; BeatR) { //6
window //7BeatL : sliding, count(100); //8BeatR : sliding, time(10); //9
param //10match : BeatR.name == BeatL.firstName + " " + BeatL.lastName && //11
department == "HR"; //12output //13
Join1 : salary = salary * 2ul; //14} //15// equi-join with an additional match condition //16stream Join2 = Join(BeatL; BeatR) { //17
window //18BeatL : sliding, count(100); //19BeatR : sliding, time(10); //20
param //21match : department == "HR"; //22equalityLHS : BeatL.firstName + " " + BeatL.lastName; //23equalityRHS : name; //24
output //25Join2 : salary = salary * 2ul; //26
} //27// equi-join with multiple equality expressions //28stream Join3 = Join(BeatL; BeatR) { //29
window //30BeatL : sliding, count(100); //31BeatR : sliding, time(10); //32
param //33equalityLHS : BeatL.firstName + " " + BeatL.lastName, "HR"; //34equalityRHS : name, department; //35
output //36Join3 : salary = salary * 2ul; //37
} //38
// single-sided partitioned join with a 0 sized window on the right hand side //39// and a partitioned window of 1 on the left hand side //40stream VWAP = Beacon() {} //41stream Quote = Beacon() {} //42stream //43
Bargain = Join(VWAP; Quote) //44{ //45
window //46VWAP : sliding, count(1), partitioned; //47Quote : sliding, count(0); //48
param //49match : vwap > askprice*100.0d; //50partitionByLHS : VWAP.ticker; //51equalityLHS : VWAP.ticker; //52equalityRHS : Quote.ticker; //53
8 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
19/78
output //54Bargain : bargainIndex = exp(vwap-askprice*100.0d)*asksize; //55
} //56// a left outer join with single output //57stream MsgLHS = Beacon() {} //58stream MsgRHS = Beacon() {} //59stream //60
Msgs1 = Join(MsgLHS as L; MsgRHS as R) //61{ //62
window //63
L : sliding, count(0); //64R : sliding, delta(tm, 10ul), partitioned; //65param //66
algorithm : leftOuter; //67partitionByRHS : R.kind; //68defaultTupleRHS : { message = "N/A", kind = 0u, tm = 0ul}; //69equalityLHS : L.message, L.kind; //70equalityRHS : R.message, R.kind; //71
output //72Msgs1 : message1 = L.message, message2 = R.message; //73
} //74// a right outer join with two outputs //75(stream Msgs2; //76
stream MsgsRHS2) //77= Join(MsgLHS as L; MsgRHS as R) //78
{ //79window //80
L : sliding, count(0); //81
R : sliding, delta(tm, 10ul), partitioned; //82param //83algorithm : rightOuter; //84partitionByRHS : R.kind; //85equalityLHS : L.message; //86equalityRHS : R.message; //87
output //89Msgs2 : message1 = L.message, message2 = R.message; //90
} //91// an outer join with three outputs //92(stream Msgs3; //93
stream MsgsLHS3; //94stream MsgsRHS3) //95
= Join(MsgLHS as L; MsgRHS as R) //96{ //97
window //98L : sliding, count(0); //99R : sliding, delta(tm, 10ul), partitioned; //100
param //101algorithm : outer; //102partitionByRHS : R.kind; //103equalityLHS : L.message; //104equalityRHS : R.message; //105
output //106Msgs3 : message1 = L.message, message2 = R.message; //107
} //108// an outer join with a single output. //109//Discard unreferenced partitions after 60 seconds. //110stream //111
Msgs4 = Join(MsgLHS as L; MsgRHS as R) //112{ //113
window //114L : sliding, count(0); //115R : sliding, delta(tm, 10ul), partitioned, partitionAge(60.0); //116
param //117algorithm : outer; //118partitionByRHS : R.kind; //119equalityLHS : L.message; //120equalityRHS : R.message; //121
output //122Msgs4 : message1 = L.message, message2 = R.message; //123
} //124} //125
Aggregate
DescriptionThe Aggregate operator is used to compute user-specified aggregationsover tuples gathered in a window.
Chapter 1. Relational Operators 9
-
8/22/2019 Pl Standard Toolkit Reference
20/78
Input PortsThe Aggregate operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is WindowBound. TheAggregate operator will process window marker punctuations whenconfigured with a punctuation based window.
Output Ports
The Aggregate operator is configurable with a single output port. Theoutput port is mutating and its punctuation mode is Generating. TheAggregate operator will generate a window punctuation after each batch ofaggregations it outputs.
ParametersThe Aggregate operator has the following parameters:
groupBy
This an optional parameter that specifies one or more expressionsto be used for dividing the tuples in a window into groups. Whena window fires (a sliding window triggers or a tumbling windowflushes), one tuple with the user-specified aggregations iscomputed for each group in the window and these tuples are
output as a batch. A window marker punctuation is output afterthe tuples.
partitionBy
This is an optional parameter that is only valid for an Aggregateoperator configured with a partitioned window (see below). Itspecifies one or more expressions to be used for partitioning theinput tuples into sub-windows, where all window and parameterconfigurations apply to the sub-windows, independently.
aggregateIncompleteWindows
This optional parameter of type boolean is valid only for slidingwindows. The default value is false. When set to true,aggregations will be done when trigger occurs, even if the window
has not filled up. If set to false, triggers before the window is fullwill be ignored.
WindowingThe Aggregate operator supports the following window configurations:
tumbling, (count | delta | time | punctuation)-based eviction(, partitioned (, partitionEvictionSpec)? )?
sliding, (count | delta | time)-based eviction, (count |delta|time)-based trigger (, partitioned (, partitionEvictionSpec)? )?
For the tumbling variants, tuples are aggregated when the window getsfull (and flushes). The tuples containing the aggregates are output at once,followed by a window marker punctuation. Note that more than one tuplecan be output when the groupBy parameter is specified.
For the sliding variants, tuples are aggregated when the window triggers.The tuples containing the aggregates are output at once, followed by awindow marker punctuation. Note that more than one tuple can be outputwhen the groupBy parameter is specified.
The sliding windows for an Aggregate operator do not fire until thewindow is full for the first time unless aggregateIncompleteWindows istrue. This rule does not apply to sliding windows with time-based triggerpolicies. Such windows are assumed to be full when they start out.
10 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
21/78
Both for tumbling and sliding windows, when a time-based window withno tuples in it fires, just a window marker punctuation is output. When atumbling, punctuation-based window with no tuples in it receives awindow marker punctuation, just a window marker punctuation is output.
For the partitioned variants, the window specification and parametersapply to individual sub-windows identified by the partitionBy parameter,
as if there were separate Aggregate operators for each partition.The final punctuation marker does not flush any of the pending windows.
AssignmentsThe Aggregate operator allows aggregated assignments to outputattributes. An aggregated assignment has an aggregation function appearingon the right-hand side of the assignment. The following aggregationfunctions are supported:
v int32 Count(): number of tuples in the group.
v int32 CountGroups(): number of groups in a window.
v int32 CountAll(): number of tuples in the window.
v list CountByGroup(): list of group sizes (number of tuples in the
group) in a window.v T Any(T v): expression value (v) computed for any tuple in the
group (useful for expressions that depend on the groupBy expressions).
v T First(T v): expression value (v) computed for the first(earliest) tuple in the group.
v T Last(T v): expression value (v) computed for the last (latest)tuple in the group.
v list Collect(T v): collection of expression values (v's)computed for the tuples in the group.
v list CollectDistinct(T v): collection of unique expressionvalues (v's) computed for the tuples in the group.
v
int32 CountDistinct(T v): number of distinct expressionvalues (v's) computed for the tuples in the group.
v list CountByDistinct(T v): collection of cardinalitiesfor the distinct expression values (v's) computed for the tuples in thegroup, where the cardinality is the number of times the distinct valueappears. The order of entries in a CountByDistinct result matches theorder of entries in a corresponding CollectDistinct result.
v T Average(T v): average of the expression values (v's)computed for the tuples in the group.
v list Average(list v): list of per element averagesof the expression list values (v's) computed for the tuples in the group.All lists must have the same size.
v
T Sum(T v): sum of the expression values (v's) computedfor the tuples in the group.
v T Sum(T v): same as above, but for strings (concatenation).
v list Sum(list v): list of per element sums of theexpression list values (v's) computed for the tuples in the group. All listsmust have the same size.
v T Max(T v): maximum of the expression values (v's)computed for the tuples in the group.
Chapter 1. Relational Operators 11
|
|
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
22/78
v list Max(list v): list of per element maximums ofthe expression list values (v's) computed for the tuples in the group. Alllists must have the same size.
v T Min(T v): minimum of the expression values (v's)computed for the tuples in the group.
Remember: The Min/Max aggregate functions do a column-wisemin/max on the lists. For example,
Min([1,2,1], [1,1,2]) == [1,1,1] which is column-wise comparison.
whereas, InfoSphere Streams Version 1.2 Min/Max aggregate functionsreturn the smallest/largest list. For example,
Min([1,2,1], [1,1,2]) == [1,1,2] which is lexicographic comparison.
v list Min(list v): list of per element minimums ofthe expression list values (v's) computed for the tuples in the group. Alllists must have the same size.
v int32 MaxCount(T v): similar to Max, but returns the
number of tuples for which the maximum value occurs, rather than themaximum value itself.
v int32 MinCount(T v): similar to Min, but returns thenumber of tuples for which the minimum value occurs, rather than theminimum value itself.
v K ArgMin(T v, K w) : the argument expression value(w) corresponding to the minimum of the objective expression values(v's) computed for tuples in the group.
v list CollectArgMin(T v, K w) : similar toArgMin, but returns a list in case of more than one argumentminimizing the objective.
v K ArgMax(T v, K w): the argument expression value(w) corresponding to the maximum of the objective expression values(v's) computed for tuples in the group.
v list CollectArgMax(T v, K w) : similar toArgMax, but returns a list in case of more than one argument maximizingthe objective.
v T SampleStdDev(T v): sample standard deviation of theexpression values (v's) computed for the tuples in the group.
v T PopulationStdDev(T v): population standard deviationof the expression values (v's) computed for the tuples in the group.
Output attributes missing assignments are automatically forwarded fromthe input ones using the Last aggregate.
MetricsThe Aggregate operator has the following metrics:
v nCurrentPartitions: The number of partitions currently in the windowfor the Aggregate operator.
composite Main { //1graph //2
stream Beat = Beacon() {} //4
// tumbling window with no group by //5stream //6
Agg0 = Aggregate(Beat) //7{ //8
12 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
23/78
window //9Beat : tumbling, time(10.5); //10
output //11Agg0 : maxSalary = Max(salary), ageOfMaxSalary = ArgMax(salary, age); //12
} //13// tumbling window with group by //14stream //15
Agg1 = Aggregate(Beat) //16{ //17
window //18
Beat : tumbling, punct(); //19param //20groupBy : country, city; //21
output //22Agg1 : maxSalary = Max(salary); //23
} //24// tumbling partitioned window with no group by //25stream //26
Agg2 = Aggregate(Beat) //27{ //28
window //29Beat : tumbling, delta(id, 10lu), partitioned; //30
param //31partitionBy : country, city; //32
output //33Agg2 : maxSalary = Max(salary), //34
numPeopleWithMaxSalary = MaxCount(salary); //35} //36
// tumbling partitioned window with group by //37stream //38Agg3 = Aggregate(Beat) //39
{ //40window //41
Beat : tumbling, count(10), partitioned; //42param //43
groupBy : city; //44partitionBy : country; //45
output //46Agg3 : maxSalary = Max(salary), //47
peopleWithMaxSalary = CollectArgMax(salary, name); //48} //49// sliding window with no group by //50stream //51
Agg4 = Aggregate(Beat) //52{ //53
window //54
Beat : sliding, time(10.5), count(10); //55output //56Agg4 : maxSalary = Max(salary), ageOfMaxSalary = ArgMax(salary, age); //57
} //58// sliding window with group by //59stream //60
Agg5 = Aggregate(Beat) //61{ //62
window //63Beat : sliding, count(10), count(1); //64
param //65groupBy : country, city; //66
output //67Agg5 : maxSalary = Max(salary); //68
} //69// sliding partitioned window with no group by //70stream //71
Agg6 = Aggregate(Beat) //72{ //73
window //74Beat : sliding, delta(id, 10lu), count(10), partitioned; //75
param //76partitionBy : country, city; //77
output //78Agg6 : maxSalary = Max(salary), //79
numPeopeWithMaxSalary = MaxCount(salary); //80} //81// sliding partitioned window with group by //82stream //83
Agg7 = Aggregate(Beat) //84{ //85
window //86Beat : sliding, count(10), time(1), partitioned; //87
param //88
Chapter 1. Relational Operators 13
-
8/22/2019 Pl Standard Toolkit Reference
24/78
groupBy : city; //89partitionBy : country; //90
output //91Agg7 : maxSalary = Max(salary), //92
peopleWithMaxSalary = CollectArgMax(salary, name); //93} //94
} //95
14 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
25/78
Chapter 2. Adapter Operators
FileSource
DescriptionThe FileSource operator reads data from a file and produces tuples as aresult.
Input PortsThe FileSource operator has one optional input port. If present, the inputport schema must be a tuple with a single rstring attribute. Each tuplewill hold the file name to be read by the FileSource operator. Whileprocessing the tuple, the entire file will be read, and tuples generated bythe FileSource operator.
Output PortsThe FileSource operator is configurable with two output ports. The firstoutput port is mutating and its punctuation mode is Generating. TheFileSource operator will output a window marker punctuation when thefile is read in full.
The second output port is optional and must contain a tuple with twoattributes: one with an attribute of type rstring and one with an attributeof type int32. This stream generates tuples with the file name and 0 as theattribute values when the end of the file being read is reached. If a file failsto open, the stream generates tuples with the file name and the systemerror code. This allows a downstream operator to know which files wereprocessed, and which files could not be opened successfully.
ParametersThe FileSource operator has the following parameters:
file This is an optional parameter that specifies the name of the sourcefile. It must not be present if the FileSource operator has an inputport, otherwise it must be present. It is of type rstring. It is validfor the file parameter to refer to a named pipe, unless the hotFileparameter is set to true. hotFile is implemented using seek, andseek is not valid on a named pipe.
format This optional parameter specifies the format of the file. Validvalues are txt, csv, bin, line, and block. The default format is csv.This parameter can only take a single value. The detaileddescriptions of individual format options are as follows:
v txt: This format expects the file to be structured as a series oflines, where each line is a tuple literal, free of any type suffixes.
String literals must be in double quotes. The # character can beused to mark comment lines. An example is as follows:
# tuple{name="John", age=40}{name="Mary", age=35}
v csv: This format expects the file to be structured as a series oflines, where each line is a list of comma separated values. Stringliterals that are used at the outermost level can appear withoutthe double quotes, unless they have a ,' character or escaped
Copyright IBM Corp. 2011, 2012 15
-
8/22/2019 Pl Standard Toolkit Reference
26/78
characters, in which case double quotes are required. Bothrstring and ustring values should appear as utf-8 encodedstrings. For fields missing in the csv formatted line (as in , ,),default constructed values will be used, unless the defaultTupleparameter is specified. The separator parameter may be used tochange the default separator of ,'. '.' is used as the decimal pointfor binary and decimal floating point data. The # character can
be used to mark comment lines. An example is as follows
# tuple
John, 40, [{city="New York City",state="NY"},{city="Atlanta",state="GA"}]"Mary, and co.", 35, [{city="Toronto",state="ON"},{city="White Plains",state="NY"}]
v bin: This format expects the file to be structured as a series oftuples in binary, using network byte order. Tuple attributes areassumed to be serialized in sequence to form a tuple.
v line: This format expects the file to be structured as a series of
lines. It also expects the output stream schema to contain asingle attribute of type rstring. Each line will be converted intoa tuple, where the line text (excluding the end of line marker)
becomes the rstring attribute in the output tuple. The end ofline marker can be customized via the use of the eolMarkerparameter.
v block: This format expects the file to be structured as a series ofbinary blocks. It also expects the output stream schema tocontain a single attribute of type blob. Each block will beconverted into a tuple. The block size can be customized via theuse of the blockSize parameter. The last block read from the filemay be less than blockSize bytes.
hasHeaderLineThis optional attribute-free parameter of type boolean or uint32 isvalid only if the format is csv. If true, then the first line in the filewill be read and ignored. If false (the default), no lines will beskipped. If a uint32 expression is passed, that number of lines will
be skipped. This allows column names to be present in the firstseveral lines of the file.
ignoreOpenErrors
This optional parameter of type boolean specifies if the FileSourceoperator will continue executing if the input file cannot be opened.If the ignoreOpenErrors parameter is set to true and an input filecannot be opened, the FileSource operator logs an error and
proceeds with the next input file. If not present, or theignoreOpenErrors parameter is false, the FileSource operator willlog an error and terminate. By default, the ignoreOpenErrorsparameter is set to false.
hasDelayField
This optional parameter of type boolean is used to instruct theFileSource operator to expect an additional attribute whichspecifies a delay to be used to pace the generation of the outputtuples. By default, it is false. This parameter can only be usedwith txt, csv, and bin formats. The type of the delay attributemust be float64 and it is assumed to be in seconds. The delay
16 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
27/78
attribute must appear before the tuple. In the case of txt and csvformats the delay attribute is separated from the tuple via a singlecomma with optional spaces before and after it. For example, fortxt format:
# tuple1.50, {name="John", age=40}
1.75, {name="Mary", age=35}
And for csv format:
# tuple1.50, John, 401.75, Mary, 35
defaultTuple
This optional parameter can be specified to indicate the attributevalues to be used in case of missing values in the source data. It isonly valid for the csv format. It can take a single value of tupletype. This type must match the type of the output port tuples.
parsing
This optional parameter can be specified to customize the parsingbehavior of the FileSource operator. There are three valid values,namely: strict, permissive, and fast. When strict is specified,incorrectly formatted tuples will result in a runtime error andtermination of the operator. When permissive is specified,incorrectly formatted tuples will result in a runtime log entry to becreated, and the parser will make an effort to skip to the next tuple(formats txt and csv) and continue. If format is bin, the parser willclose the current file, and start reading the next file (if FileSourcehas an input stream). permissive can only be used with txt, csv,and bin formats. When fast is specified, the input file is assumed
to be formatted correctly, and no runtime checks will beperformed. Incorrect input in fast mode causes undefinedbehaviour. The default parsing mode is strict.
compression
This optional parameter is used to specify that the source file iscompressed. There are three valid values, representing availablecompression algorithms. These are: zlib, gzip, and bzip2.
encoding
This optional rstring parameter can be used to specify thecharacter set encoding used in the input file. The contents of thefile will be converted to the UTF-8 character set from the givencharacter set after any decompression and before extraction of the
tuples is performed. An example of a valid character set encodingis ISO_8859-9. A list of available encodings can be retrieved usingthe iconv --list command. encoding is not valid with formats binor block.
hotFile
This optional parameter of type boolean is used to specify if theinput file is hot. As opposed to regular files, hotfiles are not closedwhen the end of the file is reached for the first time. Instead thefile is continuously checked for more data. If the file size shrinksduring these checks, the file offset is reset to the beginning of the
Chapter 2. Adapter Operators 17
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
28/78
file. The default value for the hotFile parameter is false. Whenset to true, a final marker is not sent upon reaching the end of thefile, as hot files ignore that event. Instead a final marker will besent upon shutdown, after a window marker punctuation is sent.Additionally, if the file offset is ever reset, a window markerpunctuation is sent. The hotFile parameter may not be specified ifthe FileSource operator has an input port, or if deleteFile or
moveFileToDirectory are specified.
deleteFile
This optional parameter of boolean is used to specify that the fileshould be removed after processing of a file is finished. ThedeleteFile parameter cannot be specified if hotFile ormoveFileToDirectory is specified.
moveFileToDirectory
This parameter of type rstring is used to specify that the fileshould be moved to the directory after processing of a file isfinished. Any file in the moveFileToDirectory directory of the samename will be removed before the move is done. ThemoveFileToDirectory cannot be specified if hotFile or deleteFileis specified.
A .rename subdirectory may be created in the target directory if thetarget directory is on a different filesystem. This is used to ensurethat the files appear atomically at the target directory.
eolMarker
This optional parameter is used to specify the end of line marker.It is of type rstring. It can only be used when the lineformat isspecified. It defaults to "\n". Valid values include strings with oneor two characters, such as"\r" and "\r\n".
initDelay
This optional float64 parameter is used to specify the number of
seconds that the FileSource operator is to delay before starting toproduce tuples. If the FileSource operator has an input stream, thedelay will happen on receipt of the first tuple. During the delay,the operator is blocked, and any more input tuples will block aswell.
blockSize
This parameter is used to specify the block size. It is of typeuint32. It is mandatory when the block format is specified andcannot appear otherwise.
separator
This optional rstring parameter is used to specify an alternateseparator character for csv format. It must be a single characterstring constant. separator may only be specified if the format iscsv.
ignoreExtraCSVValues
This optional parameter of type boolean is only relevant withformat : csv. If true, extra data on the current input line after thelast attribute read will be skipped. If not present, or ifignoreExtraCSVValues has value false, extra data on a line in csvformat will cause an error to be logged (parsing : permissive) oran exception raised (parsing : strict).
18 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
|
|
|
|
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
29/78
WindowingThe FileSource operator does not accept any window configurations.
AssignmentsThe FileSource operator does not allow assignments to output attributes.
MetricsThe FileSource operator has the following metrics:
v nFilesOpened: The number of files opened by FileSource. Onlyinteresting if the FileSource operator has an input port.
v nInvalidTuples: The number of tuples that failed to read correctly in csvor txt format.
ExceptionsThe FileSource operator will throw an exception and terminate in thefollowing cases:
v The file input file cannot be opened for reading.
v The moveFileToDirectory directory does not exist.
v The moveFileToDirectory is not a directory.
composite Main { //1
graph //2// source operator with a relative file argument //3stream Beat = FileSource() //4{ //5
param //6file : "People.dat"; // looks for /data/People.dat //7
} //8// source operator with a default tuple for missing arguments //9stream Beat1 = FileSource() //10{ //11
param //12file : "People.dat"; //13defaultTuple : {name="foo", age=19u, salary=10000ul}; //14
} //15// source operator with an absolute file argument and hot file option //16stream Beat2 = FileSource() //17{ //18
param //19
file : "/tmp/People.dat"; //20hotFile : true; //21} //22// source operator with a csv format specifier, //23// hasDelayField option, and custom seperator //24stream Beat3 = FileSource() //25{ //26
param //27file : "People.dat"; //28format : csv; //29separator : "|"; //30hasDelayField : true; //31
} //32// source operator with a txt format specifier and compression //33stream Beat4 = FileSource() //34{ //35
param //36file : "People.dat"; //37format : txt; //38compression : zlib; //39
} //40// source operator with a csv format specifier and with strict parsing, waiting //41// 5 seonds before starting to process the file //42stream Beat5 = FileSource() //43{ //44
param //45file : "People.dat"; //46format : csv; //47parsing : strict; //48initDelay : 5.0; //49
} //50// source operator with a bin format specifier //51stream Beat6 = FileSource() //52{ //53
Chapter 2. Adapter Operators 19
-
8/22/2019 Pl Standard Toolkit Reference
30/78
param //54file : "People.dat"; //55format : bin; //56
} //57// source operator with a line format specifier //58stream Beat7 = FileSource() //59{ //60
param //61file : "People.dat"; //62format : line; //63
} //64// source operator with a line format specifier, and an eolMarker option //65stream Beat8 = FileSource() //66{ //67
param //68file : "People.dat"; //69format : line; //70eolMarker : "\r"; //71
} //72// source operator with a block format specifier //73stream Beat9 = FileSource() //74{ //75
param //76file : "People.dat"; //77format : block; //78blockSize : 1024u; //79
} //80//81
stream Files = DirectoryScan() { //82param directory: "foo"; //83} //84// source operator reading tuples of 2 int32s from files in directory foo //85// Delete the files after processing is done //86stream Beat10 = FileSource(Files) //87{ //88
param deleteFile : true; //89} //90
} //91
The following example uses the second output stream, and shows how to get thestring form of the reason for failure:composite Main() { //1
graph //2stream A = Beacon () { //3
logic state : mutable int32 i = 0; //4param iterations : 4; //5output A : f = "file." + (rstring)i++; //6
} //7//8
(stream B; stream C) = FileSource (A) { //9param ignoreOpenErrors: true; //10
} //11//12
stream D = Functor (C) { //13output D : reason = strerror (e); //14
} //15//16
() as Nil = FileSink (D) { //17param file : "out"; //18
} //19} //20
FileSink
DescriptionThe FileSink operator writes tuples to a file.
Input PortsThe FileSink operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.
Output PortsThe FileSink operator is configurable with an optional output stream of
20 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
31/78
type stream, which will have the file name that was just closed.If the file is moved, the destination filename will be generated as theoutput stream.
ParametersThe FileSink operator has the following parameters:
file This is a mandatory parameter that specifies the name of the
output file. See the corresponding parameter in the FileSourceoperator for details. Only the last component of the pathname will
be created if it does not already exist. All directories in thepathname up to the last component must already exist. Forexample, in file : "/a/b/c", /a and /a/b must already exist and bedirectories. The file is created as an empty file, discarding anyprevious contents. The user id and the umask of the instanceowner will be used. The tuples written to the file will be flushed todisk according to the flush and flushOnPunctuation parameters.
append This optional boolean parameter is used to specify that thegenerated tuples will be appended to the output file. If false, ornot specified, the output file will be truncated before the tuples are
generated.format See the corresponding parameter in the FileSource on page 15
operator for details.
hasDelayField
This optional parameter of type boolean is used to output anadditional attribute per tuple, which specifies the inter-arrivaldelays between the input tuples. See the corresponding parameterin the FileSource on page 15 operator for details.
compression
See the corresponding parameter in the FileSource on page 15operator.
encodingThis optional rstring parameter can be used to specify thecharacter set encoding used in the output file. Data written to theoutput file will be converted from the UTF-8 character set to thegiven character set before any compression is performed. encodingis not valid with formats bin or block.
eolMarker
See the corresponding parameter in the FileSource on page 15operator.
flush This optional parameter of type uint32 is used to flush the outputfile after given number of tuples. By default no flushing on tuplenumbers is performed.
Note: If an application expects low volumes of data, use the flushparameter to ensure that the output file is written to disk.
flushOnPunctuation
This optional parameter of type boolean is used to flush the outputfile when punctuation is received. flushOnPunctuation defaults totrue.
writePunctuations
This optional parameter of type boolean is used to write
Chapter 2. Adapter Operators 21
-
8/22/2019 Pl Standard Toolkit Reference
32/78
punctuations to the output file. It is false by default.writePunctuations can only be used with txt and csv formats.
separator
See the corresponding parameter in the FileSource on page 15operator.
quoteStrings
This optional parameter of type boolean is used to control thequoting of top-level rstrings. It is true by default. If true, rstringsin the tuple will be generated with a leading and trailing doublequote ("), and control characters will be escaped. If false, rstringsin the tuple will be written as is. quoteStrings can only be usedwith the csv format.
closeMode
This is an optional parameter of type enum {punct, count, size,time, never}. The default value is never. For any other value,when the specified condition is satisfied, the current output file isclosed and a new file is opened for writing. In such cases, the fileparameter must contain one or more {id} fields to indicate the parts
that will be updated with the file id. For example, in the file name"myfile{id}.dat", each {id} will be replaced by 0 for the first file, 1for the next file that is opened and so on.
tuplesPerFile
This parameter is used to specify the maximum number of tuplesthat can be received for each output file. When the specifiednumber of tuples are received, the current output file is closed anda new file is opened for writing. This parameter is of type uint64or uint32 and must be specified if closeMode parameter is set tocount.
timePerFile
This parameter of type float64 is used to specify the approximate
time, in seconds, after which the current output file is closed and anew file is opened. This parameter must be specified if thecloseMode parameter is set to time.
bytesPerFile
This parameter is used to specify the approximate size of theoutput file, in bytes. When the file size exceeds the specifiednumber of bytes, the current output file is closed and a new file isopened. This parameter is of type uint64 or uint32 and must bespecified when the closeMode parameter is set to size.
moveFileToDirectory
This optional parameter of type rstring is used to specify that thefile should be moved to the named directory after the file is closed.Any existing file with same name is removed before moving thefile to the moveFileToDirectory directory.
A .rename subdirectory may be created in the target directory if thetarget directory is on a different filesystem. This is used to ensurethat the files appear atomically at the target directory.
WindowingThe FileSink operator does not accept any window configurations.
AssignmentsThe FileSink operator does not allow assignments to output attributes.
22 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
33/78
ExceptionsThe FileSink operator will throw an exception and terminate the operatorin the following case:
v The file output file cannot be opened for writing.
composite Main { //1graph //2
stream Beat = Beacon() {} //3// sink operator with the hasDelayField option, and fields separated by ": //4// rstrings will not be printed with double quotes //5() as Sink1 = FileSink(Beat) //6{ //7
param //8file : "/tmp/People.dat"; //9format : csv; //10separator : ":"; //11hasDelayField : true; //12quoteStrings: false; //13
} //14// sink operator with a txt format specifier and compression //15() as Sink2 = FileSink(Beat) //16{ //17
param //18file : "People.dat"; //19format : txt; //20compression : zlib; //21
} //22
// sink operator with a bin format specifier and flush option //23() as Sink3 = FileSink(Beat) //24{ //25
param //26file : "People.dat"; //27format : bin; //28flush : 1u; //29
} //30// sink operator with a writePunctuations option and no flushing on punctuation //31() as Sink4 = FileSink(Beat) //32{ //33
param //34file : "People.dat"; //35writePunctuations : true; //36flushOnPunctuation: false; //37
} //38} //39
DirectoryScan
DescriptionThe DirectoryScan operator watches a directory, and generates file nameson the output, one for each file that is found in the directory. The absolutepathname of the file is generated. The file name will only be generated thefirst time the file is seen during a directory scan until it is recreated. Thechange time (ctime) is used to detect if a file has been recreated. Outputclause and custom output functions can be used to specify additionalinformation about a file. All non-regular files found in the directory areignored during the scan.
Note: Because the change time of the file is used to detect if a file has beenrecreated, it is possible that very large files are still being written when adirectory is being scanned. In this case, the same file name may begenerated multiple times, if the time between scans is less than the time towrite the file. In order to avoid this, the file should be written into adifferent directory on the same filesystem as the directory being scanned,and then renamed to the target directory when complete (/bin/mv will dothis if the files are on the same filesystem). If a regular expression patternis being used to match only certain files, creating the new files under aname that fails to match the pattern, and then renaming, will also work.
Chapter 2. Adapter Operators 23
-
8/22/2019 Pl Standard Toolkit Reference
34/78
Before submitting the file name to the output stream, the DirectoryScanoperator can optionally move processed files to a different directory usingthe moveToDirectory parameter. If the moveToDirectory parameter isspecified, the file (or symbolic link) is moved to the moveToDirectorydirectory before the output tuple is generated.
When moveToDirectory is specified, it is valid to have multiple
DirectoryScan operators reading the same directory. The DirectoryScanoperator ensures that each file is submitted by only one operator bycreating a temporary .rename subdirectory in the directory andmoveToDirectory directories.
Input PortsThe DirectoryScan operator does not have any input ports.
Output PortsThe DirectoryScan operator is configurable with a single output port. Theoutput port is non-mutating and its punctuation mode is Free. The outputschema for DirectoryScan operator is a tuple. The generated tuple ispopulated using the output clause. If there is no output clause, or anattribute in the tuple is not assigned using an output clause, then the
attribute must be of type rstring.Parameters
The DirectoryScan operator has the following parameters:
directory
This is a mandatory parameter that specifies the name of thedirectory to be scanned. It is of type rstring.
moveToDirectory
This optional parameter of type rstring specifies the name of thedirectory to which files should be moved before the output tuple isgenerated.
pattern
This optional parameter of type rstring is used to instruct theDirectoryScan operator to ignore file names that do not match theregular expression pattern.
sortBy This optional parameter determines the order in which file namesare generated during a single scan of the directory when there aremultiple valid files at the same time. The valid values are date andname. If the sortBy parameter is not specified, the default sort orderis set to date.
order This optional parameter controls how the sortBy parameter sortsthe files. The valid values are ascending and descending. If theorder parameter is not specified, the default value is set toascending.
If sortBy is set to date, the file with the oldest change time (ctime)is generated first for ascending order. If sortBy is set to name, thefile name that is lexically smallest is generated first for ascendingorder.
sleepTime
This optional parameter of type float64 instructs theDirectoryScan operator of the minimal time between scans of thedirectory, in seconds. If not specified, the default is 5.0 seconds. Ifthe time difference between the start of the last scan and the
24 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
35/78
current time is less than sleepTime seconds, the DirectoryScanoperator will sleep until the time since the last scan is sleepTimeseconds. If more than sleepTime seconds have already passed, thenext scan will begin immediately.
initDelay
This optional float64 parameter is used to specify the number of
seconds that the DirectoryScan operator is to delay before startingto produce tuples.
ignoreDotFiles
This optional boolean parameter determines if the DirectoryScanoperator ignores files with a leading period (.) in the directory. Bydefault, the value is set to false and files with a leading period areprocessed.
ignoreExistingFilesAtStartup
This optional boolean parameter determines if the DirectoryScanoperator ignores pre-existing files in the directory. By default, thevalue is set to false and all files are processed as usual. If set totrue, any files present in the directory are marked as already
processed, and not submitted. If initDelay is specified, this checkis done before the DirectoryScan operator delays.
AssignmentsThe DirectoryScan operator supports the following custom outputfunctions:
v rstring FilePath(): The pathname to the file in the directory, relative tothe input directory parameter.
v rstring FileName(): The last component of the pathname.
v rstring FullPath(): The absolute pathname to the file in the directory.
v rstring DestinationFullPath(): The absolute pathname to the file inthe destination directory.
v
rstring Directory(): The value of the directory parameter.v rstring DestinationDirectory(): The value of the moveToDirectory
parameter, or the directory parameter if moveToDirectory is notspecified
v rstring DestinationFilePath(): The pathname to the file in thedestination directory.
v uint64 Size(): The size of the file in bytes.
v uint64 Atime(): The access time (atime) of the file in seconds since theepoch.
v uint64 Ctime(): The change time (ctime) of the file in seconds since theepoch.
v uint64 Mtime(): The modification time (
mtime) of the file in secondssince the epoch.
Note: The atime, ctime, and mtime fields are set from the original file inthe source directory.
MetricsThe DirectoryScan operator has the following metrics:
v nScans: The number of times the DirectoryScan operator has read thedirectory.
Chapter 2. Adapter Operators 25
|
|
|
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
36/78
ExceptionsThe DirectoryScan operator will throw an exception and terminate in thefollowing cases:
v The directory or moveToDirectory does not exist.
v The directory or moveToDirectory is not a directory.
v The pattern is not a valid regular expression.
v The .rename directories cannot be created when moveToDirectory isspecified.
composite Main { //1graph //2
// DirectoryScan operator with a relative directory argument //3stream Dir1 = DirectoryScan() //4{ //5
param //6directory : "People.dir"; //7initDelay: 10.0; //8
} //9// DirectoryScan operator with an absolute file argument and a file name pattern //10stream Dir2 = DirectoryScan() //11{ //12
param //13directory : "/tmp/work"; //14pattern : "^work.*"; //15
} //16// use a FileSource operator to process the file names //17stream Beat6 = FileSource(Dir2) //18{ //19
param // note: param file is not specified //20format : line; //21deleteFile : true; // delete the file when processing is finished //22
} //23// Use DirectoryScan operator to move files to a different directory. //24// Move the scanned files to the /tmp/active directory. Generate a tuple containing //25// the original filename in /tmp/work (sourceFile), and the moved filename //26// in /tmp/active (movedFile). //27// Generate the size of the file (fileSize). //28stream Dir3 = DirectoryScan() //29{ //30
param //31directory : "/tmp/work"; //32moveToDirectory : "/tmp/active"; //33
output Dir3 : sourceFile = FilePath(), movedFile = DestinationFilePath(), //34fileSize = Size(); //35} //36
} //37
TCPSource
DescriptionThe TCPSource operator reads data from a TCP socket and creates tuplesout of it. It can be configured as a TCP server (listens for a clientconnection) or as a TCP client (initiates a connection to a server). In bothmodes it handles a single connection at a time. It works with both IPv4and IPv6 addresses.
Input PortsThe TCPSource operator does not have any input ports.
Output PortsThe TCPSource operator is configurable with a single output port. Theoutput port is mutating and its punctuation mode is Generating. TheTCPSource operator will output a window marker punctuation when a TCPconnection terminates.
ParametersThe TCPSource operator has the following parameters:
role This mandatory parameter specifies whether the TCPSource
26 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
37/78
operator is server-based or client-based. It takes one of thefollowing two values: server and client.
address
In the case of a client-based TCPSource operator, this parameterspecifies the destination server address of the TCP connection. Theaddress parameter must be specified when the role parameter is
set to client and thename
parameter is not specified. In all othercases, it cannot be specified. It takes a single value of type rstring.This value could be a host name or an IP address. address may not
be used for a server-based TCPSource operator, as the address usedis always on the current host.
port In the case of a server-based TCPSource operator, this parameterspecifies the port address on which the connections will beaccepted. In the case of a client-based TCPSource operator, itspecifies the destination server port address. It takes a single valueof type rstring or type uint32. This could be a well known portalias, such as http'' or ftp''1, as well as a plain port number, suchas 45134u. It is an optional parameter for server-based TCPSourceoperators and when omitted its default value is 0, which picks anyavailable port. For client-based TCPSource operators, the portparameter must be specified when the name parameter is notspecified, and it cannot be specified otherwise.
name In the case of a server-based TCPSource operator, this parameterspecifies the name to be used to register the address and port pairfor the server with the name service that is part of the Streamsruntime. This name can be used by a corresponding client-basedTCPSink operator to connect to this operator by just specifying thename. These names are automatically prefixed by the applicationscope, thus applications with differing scopes cannot communicatethrough the same name. The application scope can be set throughthe use of config applicationScope on the main composite in the
application. It is an error for a name with the same applicationscope to be defined multiple times with an instance. If multipleoperators attempt to define the same name, the second andsubsequent operators will keep trying periodically to register thename, with an error message for each failure. In the case of aclient-based TCPSource, this parameter specifies the name to beused to lookup the address and port pair for the destination serverfrom the name service that is part of the Streams runtime. It is anoptional parameter that takes a single value of type rstring.streamtool getnsentry command can be used to query server-basedTCPSource addresses. The Value field will contain host:port. Whenthe name parameter is specified in the client-mode, then the portand address parameters cannot be specified.
parsing
This optional parameter can be specified to customize the parsingbehavior of the TCPSource operator. There are three valid values,namely: strict, permissive, and fast. When strict is specified,incorrectly formatted tuples will result in a runtime error andtermination of the operator. When permissive is specified,incorrectly formatted tuples will result in a runtime log entry to be
1. As specified under /etc/services
Chapter 2. Adapter Operators 27
|
|
|
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
38/78
created, and the parser will make an effort to skip to the next tuple(formats txt and csv) and continue. If format is bin, the parser willclose the current connection, and start reading the next connection(if the reconnectionPolicy permits). permissive can only be usedwith txt, csv, and bin formats. When fast is specified, the inputfile is assumed to be formatted correctly, and no runtime checkswill be performed. Incorrect input in fast mode causes undefined
behavior. The default parsing mode is strict.
interface
This optional rstring parameter specifies the network interface touse to register when the name parameter is specified. interface isonly valid when role is server and when name is specified. Usinginterface with name will ensure that a matching operator with arole of client and the same name parameter will use the desiredinterface.
receiveBufferSize
This is an optional parameter that is used to override the defaultkernel receive buffer size. It is of type uint32.
reconnectionPolicyThis is an optional parameter that specifies the reconnection policy.In the case of a server-based TCPSource operator, this parameterspecifies if additional connections are allowed once the initialconnection terminates. In the case of a client-based TCPSourceoperator, this parameter specifies if additional connection attemptswill be made once the initial connection to the server terminates.The valid values are: NoRetry, InfiniteRetry, and BoundedRetry. Ifnot specified, it is set to InfiniteRetry. When set to NoRetry, theTCPSource operator produces a final marker punctuation rightaway, after the initial connection is terminated and a windowmarker punctuation is sent.
reconnectionBound
This parameter specifies the number of successive connections thatwill be attempted for a client-based TCPSource operator or acceptedfor a server-based TCPSource operator. It is an optional parameterof type uint32. It must appear when the reconnectionPolicyparameter is set to BoundedRetry and cannot appear otherwise.
format See the corresponding parameter in the FileSource on page 15operator for details.
defaultTuple
See the corresponding parameter in the FileSource on page 15operator for details.
hasDelayField
See the corresponding parameter in the FileSource on page 15operator for details.
compression
See the corresponding parameter in the FileSource on page 15operator for details.
encoding
See the corresponding parameter in the FileSource on page 15operator for details.
28 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
|
|
|
|
|
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
39/78
eolMarker
See the corresponding parameter in the FileSource on page 15operator for details.
blockSize
See the corresponding parameter in the FileSource on page 15operator for details.
initDelaySee the corresponding parameter in the FileSource on page 15operator for details.
separator
See the corresponding parameter in the FileSource on page 15operator for details.
ignoreExtraCSVValues
See the corresponding parameter in the FileSource on page 15operator for details.
AssignmentsThe TCPSource operator does not allow assignments to output attributes.
MetricsThe TCPSource operator has the following metrics:
v nReconnections: The number of times the TCPSource operator lostconnection and reconnected to the other end of the TCP socket.
v nInvalidTuples: The number of tuples that failed to read correctly in csvor txt format.
v nConnections: The number of currently active TCP/IP connections. Thevalue is 0 if the TCPSource operator is waiting for a connection or areconnection, or 1 if the operator is currently connected.
ExceptionsThe TCPSource operator will throw an exception and terminate the operator
in the following cases:v The host cannot be resolved.
v The name cannot be located.
v Unable to set SO_REUSEADDR on TCP socket.
v Unable to bind to port.
composite Main { //1graph //2
// server source with an alias string as port //3stream Beat = TCPSource() //4{ //5
param //6role : server; //7port : "ftp"; //8
} //9// server source with a number string as port //10stream Beat1 = TCPSource() //11{ //12
param //13role : server; //14port : 23145u; //15
} //16// server source with a name, registering interface eth1 //17stream Beat2 = TCPSource() //18{ //19
param //20role : server; //21name : "my_server"; //22interface : "eth1"; //23
} //24// server source with a name and port //25
Chapter 2. Adapter Operators 29
|
|
|
-
8/22/2019 Pl Standard Toolkit Reference
40/78
stream Beat3 = TCPSource() //26{ //27
param //28role : server; //29port : 23145u; //30name : "my_server"; //31
} //32// server source with a port and infinite reconnection //33stream Beat4 = TCPSource() //34{ //35
param //36role : server; //37port : "ftp"; //38reconnectionPolicy : InfiniteRetry; //39
} //40// server source with a port and reconnection (5 times) //41stream Beat4r = TCPSource() //42{ //43
param //44role : server; //45port : "ftp"; //46reconnectionPolicy : BoundedRetry; //47reconnectionBound : 5u; //48
} //49// client source with an IP address and port //50stream Beat5 = TCPSource() //51{ //52
param //53
role : client; //54address : "99.2.45.67"; //55port : "ftp"; //56
} //57// client source with an host name as the address //58
stream Beat6 = TCPSource() //59{ //60
param //61role : client; //62address : "mynode.mydomain"; //63port : 23145u; //64
} //65// client source with name //66stream Beat7 = TCPSource() //67{ //68
param //69role : client; //70
name : "my_server"; //71
} //72// client source with reconnection //73stream Beat8 = TCPSource() //74{ //75
param //76role : client; //77address : "mynode.mydomain"; //78port : "ftp"; //79reconnectionPolicy : InfiniteRetry; //80
} //81// client source with reconnection interval (and 10 connections) //82// Wait 5 seconds before starting //83stream Beat9= TCPSource() //84{ //85
param //86role : client; //87address : "mynode.mydomain"; //88port : "ftp"; //89reconnectionPolicy : BoundedRetry; //90reconnectionBound : 10u; //91initDelay : 5.0; //92
} //93} //94
30 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference
-
8/22/2019 Pl Standard Toolkit Reference
41/78
TCPSink
DescriptionThe TCPSink operator writes data to a TCP socket in the form of tuples. Itcan be configured as a TCP server (listens for a client connection) or as aTCP client (initiates a connection to a server). In both modes it handles asingle connection at a time.
Input PortsThe TCPSink operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.
Output PortsThe TCPSink operator does not have any output ports.
ParametersThe TCPSink operator has the following parameters:
role See the corresponding parameter in the TCPSource on page 26operator.
address
See the corresponding parameter in the TCPSource on page 26operator.
port See the corresponding parameter in the TCPSource on page 26operator.
name In the case of a server-based TCPSink operator, this parameterspecifies the name to be used to register the address and port pairfor the server with the name service that is part of the Streamsruntime. This name can be used by a corresponding client-basedTCPSource operator to connect to this operator by just specifyingthe name, without the need for an address or port number. Thesenames are automatically prefixed by the application scope, thusapplications with differing scopes cannot communicate through the
same name. The application scope can be set through the use ofconfig applicationScope on the main composite in theapplication. It is an error for a name with the same applicationscope to be defined multiple times with an instance. If multipleoperators attempt to define the same name, the second andsubsequent operators will keep trying periodically to register thename, with an error message for each failure. In the case of aclient-based TCPSink, this parameter specifies the name to be usedto lookup the address and port pair for the destination server fromthe name service that is part of the Streams runtime. It is anoptional parameter that takes a single value of type rstring. Whenthe name parameter is specified in the client-mode, then the portand address parameters cannot be specified.
interface
This optional rstring parameter specifies the network interface touse to register when the name parameter is specified. interface isonly valid when role is server and when name is specified. Usinginterface with name will ensure that a matching operator with arole of client and the same name parameter will use the desiredinterface.
Chapter 2. Adapter Operators 31
-
8/22/2019 Pl Standard Toolkit Reference
42/78
sendBufferSize
This is an optional parameter that is used to override the defaultkernel send buffer size. It is of type uint32.
reconnectionPolicy
See the corresponding parameter in the TCPSource on page 26operator.
reconnectionBoundSee the corresponding parameter in the TCPSource on page 26operator.
format See the corresponding parameter in the FileSink on page 20operator.
hasDelayField
See the corresponding parameter in the FileSink on page 20operator.
compression
See the corresponding parameter in the FileSink on page 20operator.
encoding
See the corresponding parameter in the FileSink on page 20operator.
eolMarker