Tools for processing text: awk - Santa Monica...

13
1 Tools for processing text: awk Tools for processing text: awk David Morgan awk scripts awk scripts patterns - actions

Transcript of Tools for processing text: awk - Santa Monica...

Page 1: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

1

Tools for processing text: awkTools for processing text: awk

David Morgan

awk scriptsawk scripts

� patterns - actions

Page 2: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

2

Kinds of patternsKinds of patterns

� /regular expression/

� relational expression

� pattern-matching expression

� BEGIN

� END

Kinds of actionsKinds of actions

� variable assignments

� input/output commands

� built-in functions

� control flow commands

� user-defined functions

Page 3: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

3

Ways to run gawkWays to run gawk

� gawk ‘pattern’ file default action: print line

� gawk ‘{action}’ file default pattern: match line

� gawk ‘pattern {action}’ file

� gawk –f script file pattern{action}’s in script

Script operationScript operation

� process each line from an input source

� apply to it each pattern{action} line in the script

� if the lines match (per pattern), apply the action to the input line

AWK PROGRAM EXECUTION

read program source from the "script" file

execute any code in BEGIN block(s)

read input "file" (or if none, standard input)

test each input record against patterns in the AWK program in order of appearanceif it matches any pattern, execute the associated action

execute any code in END block(s)

--gawk man page, heavily abridged and adapted

Page 4: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

4

Ways to run gawkWays to run gawk

datadata

pattern, no action

(default: print line)

action, no pattern

(default: select line)

pattern plus action

multiple

pattern{action}’son command line

in script file

(same result)

BEGIN to trigger actionBEGIN to trigger action

correct syntax Ineffective! gawk is waiting for standard input,

something to math, nothing printed

Only “acts” on a matched “pattern”, so give it one to spur it on

Page 5: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

5

Some gawk numeric functionsSome gawk numeric functions

Numeric Functions

atan2(y, x) returns arctangent of y/x in radians.

cos(expr) returns cosine of expr

exp(expr) returns e raised to expr

int(expr) truncates expr to integer.

log(expr) returns natural logarithm of expr

rand() returns random number 0 < N < 1

sin(expr) returns sine of expr

sqrt(expr) returns square root of expr

Some gawk string functionsSome gawk string functions

String Functions

asort(s [, d]) sort array s

gsub(r, s [, t]) search and replace

index(s, t) returns position of substring t within string d

length([s]) returns length of string s

match(s, r [, a]) returns the position of r in s

split(s, a [, r]) split string s into array a on regular expressionseparator r

sprintf(fmt, expr-list) prints expr-list according to fmt

strtonum(str) returns its numeric value of string str

substr(s, i [, n]) returns the at most n-character substring of s

starting at i.

tolower(str) lower-cases string str

toupper(str) upper-cases string str

Page 6: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

6

Some gawk bitwise functionsSome gawk bitwise functions

and(v1, v2) return bitwise AND of v1 and v2

compl(val) return the bitwise complement of val

lshift(val, count) return val, shifted left by count bits

or(v1, v2) return bitwise OR of v1 and v2

rshift(val, count) return val, shifted right by count bits

xor(v1, v2) return bitwise XOR of v1 and v2

system functionsystem function

� gawk 'BEGIN{system("date")}‘

� gawk 'BEGIN{"date" | getline d ; print d}'

Page 7: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

7

Extracting a substringExtracting a substring

index( )

substr( )

VariablesVariables

� categories

� user-defined

� built-in

� field

� data types

� string or number

� context-inferred

Page 8: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

8

Field variable namesField variable names

� $1, $2 … for first, second …

� $0 for whole line

� shell positional parameters use same names

� for command line arguments

� args, similarly, are command line “fields”

� but shell $1 and awk $1 are not the same

� don’t confuse them

Variable naming and typingVariable naming and typing

naming

Type inference

from context

Page 9: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

9

Main builtMain built--in variablesin variables

NF The number of fields in the current input record

NR The total number of input records seen so far

FS The input field separator, a space by default

RS The input record separator, by default a newline

OFS The output field separator, a space by default

ORS The output record separator, by default a newline

FILENAME The name of the current input file

ARGC The number of command line arguments

ARGV Array of command line arguments

Field separator specificationField separator specification

recognizes passwd file’s colon as field separator

Page 10: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

10

Functions, arrays, loops Functions, arrays, loops (bubble sort)(bubble sort)

Current row into (unsorted) array

print (sorted) array

send for sorting (by reference!)

loop that spanned these lines, not

in script’s code, is gawk’s internal

main loop

loop to increasing high-water marks, from 2

Visit descendingly successive pairs,

swapping if out-of-order stopping upon an

in-order pair

Quigley textbook, p. 249

Save/restore text filesSave/restore text files

Save files’ lines,

Mark each withits filename

we have

3 files

bye, files

bringthem

back

they’re back

*

* Consumes a file descriptor.

in production use, first line:… { close(prev); prev = $1 }not to exhaust available descriptors

+ The redirection operators > and >> are used to put output into files instead of the standard output. ... It is also mportant to note that a redirection operator opens a file only once; each successive print or printf statement adds more data to the open file. When the redirection operator > is used, the file is initially cleared before any output is written to it. If >> is used instead of >, the file is not initially cleared; output is appended after the original contents.

+

Page 11: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

11

RS RS –– record separatorrecord separator

Print records containing “New York”

Print records containing “New York”

one record containing “New York”S

another record containing “New York”

one record containing “New York”

another record containing “New York”

split split -- distribute fieldsdistribute fields--inin--record into elementsrecord into elements--inin--arrayarray

for for -- two kinds of for looptwo kinds of for loop

C-style for

awk-style for, for array elements

“The order in which the subscripts are considered is implementation dependent.” The AWK Programming Language p.51

Page 12: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

12

? : conditional operator? : conditional operator

expr1 expr2 expr3

evaluates to expr2 or expr3

according as

expr1 is true or false respectively

awk approach to word frequencyawk approach to word frequency

covers all fields (words) in a line, while

gawk does so for all lines

An array element “counting bucket” for each word

That occurs. The word is the associative array

subscript for the array element that counts that word’s

occurrences

most commonly occurring words in kjv.txt (King

James Version of the bible)

Page 13: Tools for processing text: awk - Santa Monica Collegehomepage.smc.edu/.../linux/a05ss-04-InteractiveBash-text-tools_awk.pdf · Tools for processing text: awk David Morgan awk scripts

13

getline functiongetline function