R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

25
R&D Group R&D Group 开开 开开开开 开开 开开开开 开开 开开开开 开开 开开开开 Liqi Gao Text Operations

Transcript of R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

Page 1: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Liqi Gao

Text Operations

Page 2: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Utilities

C/C++ Library Perl (Active Perl) Regular Expression Edit Plus / Ultra Edit Excel

Page 3: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C/C++ Language

Standard library: Read a line Remove a CR or LF Split a line

C++ Boost Library Case Conversion Trimming Replace Algorithm Finding Algorithm Split

Page 4: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C/C++: Read a Line

Though it’s simple, it’s useful! Three methods:

Page 5: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C/C++: Remove CR/LF

Get a line under Windows and Linux platform

Page 6: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C/C++: Remove CR/LF (cont.)

The noising CR Carriage Return

Page 7: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C/C++: Split a Line

Split a line by a specific character

H E L L O W O R L D !

H E L L O W O R L D !

Page 8: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C/C++: Split a Line (cont.)

Split a line

Page 9: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C++ Boost: Case Conversion

to_upper: Convert a string to upper case to_lower: Convert a string to lower case

Page 10: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C++ Boost: Trimming & Replace

Page 11: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

C++ Boost: Split

split(): splits the input into parts

Page 12: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Regular Expression

Regular expression is a powerful tool for string operations.

operator Explain Example

* 0 or more times b, be, bee, beee,…be*

? 0 or one time be,b be?

+ 1 or more times be, bee, beee…be+

[] any of enclosed [A-Z]

^ none of any char [^a-z]

() group (abc)+

Page 13: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

An Example

*\([0-9/ ]+\) *[0-9\.\?]+% empty ^( *)([0-9]+)( *) \2\t

Page 14: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

An Introduction to Perl

Excels at pattern search and text manipulation (Practical Extraction and Reporting Language)

Open source / free software Cheap! Free and available for all systems can use and install without restriction open source promotes portability vastly expandable through freely available modules (add-

on libraries at CPAN repository) fewer restrictions/lower cost for commercial use can buy fancy development tools if desired centralized source, linear development path avoids vendor

vicissitudes and incompatibilities!

Page 15: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Perl is not compiled

#include <stdio.h>

int main(){ float x; x = 6e9; printf(“Hello world!\n”); printf(“All %d of you!\n”, x);}

10001110110011000111011100001110111000110111011000111000110111010100110111001011001101101101010101000111001110001101010101101010101001011101011101100011111000 ...

CCompiler

CCompiler

#!/usr/bin/perl

$x = 6e9;print “Hello world!\n”;printf “All %d of you!\n”, $x;

PerlInterpreter

PerlInterpreter

Hello world!All 6000000000 of you!

Source Code•Plain text (ASCII)•Human readable•Human editable•Platform Independent

C (compiled)

Binary Executable•NOT human readable•NOT human editable•NOT platform independent!

CCompiler

CCompiler

Perl is not compiled

Page 16: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

A Taste of Perl: print a message

#!/usr/bin/perl -w - command interpretation header

$x = 6e9; - variable assignment statement

print “Hello world!\n”;

printf “All %d of you!\n”, $x; } - function calls(output statements)

perltaste.pl: Greet the entire world.

Page 17: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Scalar Values

Numerical Values integer: 5, “3”, 0, -307 floating point: 6.2e9, -4022.33 hexadecimal/octal: 0x0d4f, 0477 NOTE: all numerical values stored as floating-point

numbers (usu. “double” precision)

Page 18: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

String Values

Double-quoted: interpolates (replaces variable name/control character with it’s value)

Single-quoted: no interpolation done (as-is)

Quoting operators: qq//, qw//, etc.

$day = “Monday”;

“Happy Monday!\n” Happy Monday!<NL>“Happy $date!\n” Happy Monday!<NL>

‘Happy Monday!\n’ Happy Monday!<NL>‘Happy $date!\n’ Happy $date!\n

Page 19: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

String ManipulationConcatenation

$dna1 = “ACTGCGTAGC”;$dna2 = “CTTGCTAT”;

juxtapose in a string assignment or print statement$new_dna = “$dna1$dna2”;

Use the concatenation operator ‘.’ $new_dna = $dna1 . $dna2;

Add segments serially using incremental concatenation:

$new_dna = $dna1; $new_dna .= $dna2;

(shorthand for: $new_dna = $new_dna . $dna2;)

Page 20: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Substitution

DNA transcription: T U

Substitution operator s//:$dna = “GATTACATACACTGTTCA”;

$rna = $dna;

$rna =~ s/T/U/;# “GAUUACAUACACUGUUCA”

Exercise: Start with $dna =“gattACataCACTgttca”;

and do the same as above. Print out $rna to the screen.

Page 21: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

transcribe.pl:$dna =“gattACataCACTgttca”;$rna = $dna;$rna =~ s/T/U/g;print "DNA: $dna\n";print "RNA: $rna\n";

Does it do what you expect? If not, why not?

Patterns in substitution are case-sensitive! What can we do?• Convert all letters to upper (or lower) case (preferred when possible)• If we want to retain mixed case, use transliteration operator tr//

$rna =~ tr/tT/uU/;

Page 22: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Case conversion

$string = “acCGtGcaTGc”;Upper case:

$dna = uc($string); # “ACCGTGCATGC” or $dna = uc $string; or $dna = “\U$string”;

Lower case:$dna = lc($string); # “accgtgcatgc” or $dna = “\L$string”;

Sentence case:$dna = ucfirst($string) # “Accgtgcatg

c” or $dna = “\u\L$string”;

Page 23: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Perl in NLP

Look up in Dictionary Word Frequency Chinese Word Segmentation POS …… Whatever you could need

Page 24: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Case study

Page 25: R&D Group 开发 以人为本 交流 创造价值 Liqi Gao Text Operations.

R&D Group R&D Group

开发 以人为本 交流 创造价值 开发 以人为本 交流 创造价值

Thanks for your attention