Chapter 23 Text Processing Bjarne Stroustrup .
-
Upload
julia-alexander -
Category
Documents
-
view
249 -
download
8
Transcript of Chapter 23 Text Processing Bjarne Stroustrup .
Chapter 23Chapter 23Text ProcessingText Processing
Bjarne StroustrupBjarne Stroustrupwww.stroustrup.com/Programmingwww.stroustrup.com/Programming
OverviewOverview
Application domainsApplication domains StringsStrings I/OI/O MapsMaps Regular expressionsRegular expressions
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 22
Now you know the basicsNow you know the basics
Really! Congratulations!Really! Congratulations!
Don’t get stuck with a sterile focus on programming language Don’t get stuck with a sterile focus on programming language featuresfeatures
What matters are programs, applications, what good can you What matters are programs, applications, what good can you do with programmingdo with programming Text processingText processing Numeric processingNumeric processing Embedded systems programmingEmbedded systems programming BankingBanking Medical applicationsMedical applications Scientific visualizationScientific visualization Animation Animation Route planningRoute planning Physical designPhysical design
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 33
Text processingText processing
““all we know can be represented as text”all we know can be represented as text” And often isAnd often is
Books, articlesBooks, articles Transaction logs (email, phone, bank, sales, …)Transaction logs (email, phone, bank, sales, …) Web pages (even the layout instructions)Web pages (even the layout instructions) Tables of figures (numbers)Tables of figures (numbers) MailMail ProgramsPrograms MeasurementsMeasurements Historical dataHistorical data Medical recordsMedical records ……
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 44
Amendment ICongress shall make no law respectingan establishment of religion, or prohibitingthe free exercise thereof; or abridging thefreedom of speech, or of the press; or theright of the people peaceably to assemble,and to petition the government for a redressof grievances.
String overviewString overview
StringsStrings std::stringstd::string
<string><string> s.size()s.size() s1==s2s1==s2
C-style string (zero-terminated array of char)C-style string (zero-terminated array of char) <cstring> <cstring> oror <string.h> <string.h> strlen(s)strlen(s) strcmp(s1,s2)==0strcmp(s1,s2)==0
std::basic_string<Ch>std::basic_string<Ch>, e.g. unicode strings, e.g. unicode strings typedef std::basic_string<char> string;typedef std::basic_string<char> string;
Proprietary string classesProprietary string classes
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 55
String conversionString conversion Simple to_stringSimple to_string
template<class T> string to_string(const T& t)template<class T> string to_string(const T& t){{
ostringstream os;ostringstream os;os << t;os << t;return os.str();return os.str();
}}
For example:For example:
string s1 = to_string(12.333);string s1 = to_string(12.333);string s2 = to_string(1+5*6-99/7);string s2 = to_string(1+5*6-99/7);
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 66
String conversionString conversion
Simple extract from stringSimple extract from string
template<class T> T from_string(const string& s)template<class T> T from_string(const string& s)
{{
istringstream is(s);istringstream is(s);
T t;T t;
if (!(is >> t)) throw bad_from_string();if (!(is >> t)) throw bad_from_string();
return t;return t;
}}
For example:For example:
double d = from_string<double>("12.333");double d = from_string<double>("12.333");
Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }");Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }");
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 77
General stream conversionGeneral stream conversion
template<typename Target, typename Source>template<typename Target, typename Source>
Target lexical_cast(Source arg)Target lexical_cast(Source arg)
{{
std::stringstream ss;std::stringstream ss;
Target result;Target result;
if (!(ss << arg)if (!(ss << arg) // // read arg into streamread arg into stream
|| !(ss >> result)|| !(ss >> result) // // read result from streamread result from stream
|| !(ss >> std::ws).eof())|| !(ss >> std::ws).eof()) // // stuff left in stream?stuff left in stream?
throw bad_lexical_cast();throw bad_lexical_cast();
return result;return result;
}}
string s = lexical cast<string>(lexical_cast<double>(" 12.7 "));string s = lexical cast<string>(lexical_cast<double>(" 12.7 ")); // // okok
// // works for any type that can be streamed into and/or out of a string:works for any type that can be streamed into and/or out of a string:
XX xx = lexical_cast<XX>(lexical_cast<YY>(XX(whatever)));XX xx = lexical_cast<XX>(lexical_cast<YY>(XX(whatever))); // // !!!!!!Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 88
I/O overviewI/O overview
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 99
istream ostream
ifstream iostream ofstream ostringstreamistringstream
fstreamstringstream
Stream I/O
in >> x Read from in into x according to x’s format
out << x Write x to out according to x’s format
in.get(c) Read a character from in into c
getline(in,s) Read a line from in into the string s
Map overviewMap overview
Associative containersAssociative containers <map><map>,, <set> <set>,, <unordered_map> <unordered_map>,, <unordered_set> <unordered_set> mapmap multimapmultimap setset multisetmultiset unordered_mapunordered_map unordered_multimapunordered_multimap unordered_setunordered_set unordered_multisetunordered_multiset
The backbone of text manipulationThe backbone of text manipulation Find a wordFind a word See if you have already seen a wordSee if you have already seen a word Find information that correspond to a wordFind information that correspond to a word
See example in Chapter 23See example in Chapter 23
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1010
Map overviewMap overview
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1111
vector<Message>
multimap<string,Message*>
“John Doe”
“John Doe”
“John Q. Public”
Mail_file:
A problem: Read a ZIP codeA problem: Read a ZIP code U.S. state abbreviation and ZIP codeU.S. state abbreviation and ZIP code
two letters followed by five digitstwo letters followed by five digits
string s;string s;while (cin>>s) {while (cin>>s) {
if (s.size()==7if (s.size()==7&& isletter(s[0]) && isletter(s[1])&& isletter(s[0]) && isletter(s[1])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[5]) && isdigit(s[6]))&& isdigit(s[5]) && isdigit(s[6]))
cout << "found " << s << '\n';cout << "found " << s << '\n';}}
Brittle, messy, unique codeBrittle, messy, unique code
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1212
A problem: Read a ZIP codeA problem: Read a ZIP code
Problems with simple solution Problems with simple solution It’s verbose (4 lines, 8 function calls)It’s verbose (4 lines, 8 function calls) We miss (intentionally?) every ZIP code number not We miss (intentionally?) every ZIP code number not
separated from its context by whitespaceseparated from its context by whitespace "TX77845""TX77845", , TX77845-1234TX77845-1234, and, and ATM77845 ATM77845
We miss (intentionally?) every ZIP code number with a We miss (intentionally?) every ZIP code number with a space between the letters and the digits space between the letters and the digits
TX 77845TX 77845 We accept (intentionally?) every ZIP code number with the We accept (intentionally?) every ZIP code number with the
letters in lower caseletters in lower case tx77845tx77845
If we decided to look for a postal code in a different format If we decided to look for a postal code in a different format we have to completely rewrite the codewe have to completely rewrite the code
CB3 0DSCB3 0DS, , DK-8000 ArhusDK-8000 ArhusStroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1313
TX77845-1234TX77845-1234 11stst try: try: wwdddddwwddddd 22ndnd (remember -12324): (remember -12324): wwddddd-ddddwwddddd-dddd What’s “special”?What’s “special”? 33rdrd:: \w\w\d\d\d\d\d-\d\d\d\d\w\w\d\d\d\d\d-\d\d\d\d 44thth (make counts explicit): (make counts explicit): \w2\d5-\d4\w2\d5-\d4 55thth (and “special”): (and “special”): \w{2}\d{5}-\d{4}\w{2}\d{5}-\d{4} But -1234 was optional?But -1234 was optional? 66thth: : \w{2}\d{5}\w{2}\d{5}((-\d{4})?-\d{4})? We wanted an optional space after TXWe wanted an optional space after TX 77thth (invisible space): (invisible space): \w{2} ?\d{5}\w{2} ?\d{5}((-\d{4})?-\d{4})? 88thth (make space visible): (make space visible): \w{2}\s?\d{5}\w{2}\s?\d{5}((-\d{4})?-\d{4})? 99thth (lots of space – or none): (lots of space – or none): \w{2}\s*\d{5}\w{2}\s*\d{5}((-\d{4})?-\d{4})?
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1414
Regex library – availabilityRegex library – availability
Not part of C++98 standardNot part of C++98 standard Part of “Technical Report 1” 2004Part of “Technical Report 1” 2004 Part of C++0xPart of C++0x Ships withShips with
VS 9.0 C++, use VS 9.0 C++, use <regex><regex>,, std::tr1::regex std::tr1::regex
GCC 4.3.0, use GCC 4.3.0, use <tr1/regex><tr1/regex>,, std::tr1::regex std::tr1::regex
www.boost.org, use www.boost.org, use <boost/regex><boost/regex>,, std::boost::regex std::boost::regex
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1515
#include <boost/regex.hpp>#include <boost/regex.hpp>#include <iostream>#include <iostream>#include <string>#include <string>#include <fstream>#include <fstream>using namespace std;using namespace std; int main()int main(){{
ifstream in("file.txt");ifstream in("file.txt"); // // input fileinput fileif (!in) cerr << "no file\n";if (!in) cerr << "no file\n";
regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // ZIP code patternZIP code patterncout << "pattern: " << pat << '\n';cout << "pattern: " << pat << '\n';
//// … …
}}
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1616
int lineno = 0;int lineno = 0;string line;string line; // // input bufferinput bufferwhile (getline(in,line)) {while (getline(in,line)) {
++lineno;++lineno;smatch matches;smatch matches;// // matched strings go herematched strings go hereif (regex_search(line, matches, pat)) {if (regex_search(line, matches, pat)) {
cout << lineno << ": " << matches[0] << '\n';cout << lineno << ": " << matches[0] << '\n'; //// whole whole matchmatch
if (1<matches.size() && matches[1].matched)if (1<matches.size() && matches[1].matched)cout << "\t: " << matches[1] << '\n‘;cout << "\t: " << matches[1] << '\n‘; // // sub-matchsub-match
}}}}
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1717
ResultsResultsInput:Input: address TX77845address TX77845
ffff tx 77843 asasasaaffff tx 77843 asasasaaggg TX3456-23456ggg TX3456-23456howdyhowdyzzz TX23456-3456sss ggg TX33456-1234zzz TX23456-3456sss ggg TX33456-1234cvzcv TX77845-1234 sdsascvzcv TX77845-1234 sdsasxxxTx77845xxxxxxTx77845xxxTX12345-123456TX12345-123456
Output:Output: pattern: "\w{2}\s*\d{5}(-\d{4})?"pattern: "\w{2}\s*\d{5}(-\d{4})?"
1: TX778451: TX778452: tx 778432: tx 778435: TX23456-34565: TX23456-3456
: -3456: -34566: TX77845-12346: TX77845-1234
: -1234: -12347: Tx778457: Tx778458: TX12345-12348: TX12345-1234
: -1234: -1234Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1818
Regular expression syntaxRegular expression syntax
Regular expressions have a thorough theoretical Regular expressions have a thorough theoretical foundation based on state machinesfoundation based on state machines You can mess with the syntax, but not much with the semanticsYou can mess with the syntax, but not much with the semantics
The syntax is terse, cryptic, boring, usefulThe syntax is terse, cryptic, boring, useful Go learn itGo learn it
ExamplesExamples Xa{2,3}Xa{2,3} // // Xaa XaaaXaa Xaaa Xb{2}Xb{2} // // XbbXbb Xc{2,}Xc{2,} // // Xcc Xccc Xcccc Xccccc …Xcc Xccc Xcccc Xccccc … \w{2}-\d{4,5}\w{2}-\d{4,5} // // \w is letter \d is digit\w is letter \d is digit (\d*:)?(\d+) (\d*:)?(\d+) // 124:1232321 :123 123// 124:1232321 :123 123 Subject: (FW:|Re:)?(.*)Subject: (FW:|Re:)?(.*) // . (dot) matches any character// . (dot) matches any character [a-zA-Z] [a-zA-Z_0-9]*[a-zA-Z] [a-zA-Z_0-9]* // // identifieridentifier [^aeiouy][^aeiouy] // not an English vowel// not an English vowel
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 1919
Searching vs. matchingSearching vs. matching
SearchingSearching for a string that matches a regular expression for a string that matches a regular expression in an (arbitrarily long) stream of datain an (arbitrarily long) stream of data regex_search() regex_search() looks for its pattern as a substring in the looks for its pattern as a substring in the
streamstream MatchingMatching a regular expression against a string (of a regular expression against a string (of
known size)known size) regex_match() regex_match() looks for a complete match of its pattern looks for a complete match of its pattern
and the stringand the string
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2020
Table grabbed from the webTable grabbed from the webKLASSE KLASSE ANTAL DRENGE ANTAL DRENGE ANTAL PIGER ANTAL PIGER ELEVER IALTELEVER IALT
0A0A 1212 1111 2323
1A1A 77 88 1515
1B1B 44 1111 1515
2A2A 1010 1313 2323
3A3A 1010 1212 2222
4A4A 77 77 1414
4B4B 1010 55 1515
5A5A 1919 88 2727
6A6A 1010 99 1919
6B6B 99 1010 1919
7A7A 77 1919 2626
7G7G 33 55 88
7I7I 77 33 1010
8A8A 1010 1616 2626
9A9A 1212 1515 2727
0MO0MO 33 22 55
0P10P1 11 11 22
0P20P2 00 55 55
10B10B 44 44 88
10CE10CE 00 11 11
1MO1MO 88 55 1313
2CE2CE 88 55 1313
3DCE3DCE 33 33 66
4MO4MO 44 11 55
6CE6CE 33 44 77
8CE8CE 44 44 88
9CE9CE 44 99 1313
RESTREST 55 66 1111
Alle klasserAlle klasser 184184 202202 386386
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2121
• Numeric fields• Text fields• Invisible field separators• Semantic dependencies
• i.e. the numbers actually mean something
• first row + second row == third row
• Last line are column sums
Describe rowsDescribe rows
Header lineHeader line Regular expression:Regular expression: ^[\w ]+(^[\w ]+( [\w ]+)*$[\w ]+)*$ As string literal:As string literal: "^[\\w ]+("^[\\w ]+( [\\w ]+)*$"[\\w ]+)*$"
Other linesOther lines Regular expression:Regular expression: ^([\w ]+)(^([\w ]+)( \d+)(\d+)( \d+)(\d+)( \d+)$\d+)$ As string literal: As string literal: "^([\\w ]+)("^([\\w ]+)( \\d+)(\\d+)( \\d+)(\\d+)( \\d+)$"\\d+)$"
Aren’t those invisible tab characters annoying?Aren’t those invisible tab characters annoying? Define a tab character classDefine a tab character class
Aren’t those invisible space characters annoying?Aren’t those invisible space characters annoying? Use Use \s\s
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2222
Simple layout checkSimple layout check
int main()int main()
{{
ifstream in("table.txt");ifstream in("table.txt"); // // input fileinput file
if (!in) error("no input file\n");if (!in) error("no input file\n");
string line;string line; // // input bufferinput buffer
int lineno = 0;int lineno = 0;
regex header( "^[\\w ]+(regex header( "^[\\w ]+( [\\w ]+)*$");[\\w ]+)*$"); // // header lineheader line
regex row( "^([\\w ]+)(regex row( "^([\\w ]+)( \\d+)(\\d+)( \\d+)(\\d+)( \\d+)$"); // \\d+)$"); // data linedata line
// // … check layout …… check layout …
}}
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2323
Simple layout checkSimple layout check
int main()int main()
{{
// … open files, define patterns …// … open files, define patterns …
if (getline(in,line)) {if (getline(in,line)) { // // check header linecheck header line
smatch matches;smatch matches;
if (!regex_match(line, matches, header))if (!regex_match(line, matches, header)) error("no header");error("no header");
}}
while (getline(in,line)) {while (getline(in,line)) { // // check data linecheck data line
++lineno;++lineno;
smatch matches;smatch matches;
if (!regex_match(line, matches, row)) if (!regex_match(line, matches, row))
error("bad line", to_string(lineno));error("bad line", to_string(lineno));
}}
}} Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2424
Validate tableValidate tableint boys = 0;int boys = 0; // // column totalscolumn totalsint girls = 0;int girls = 0;
while (getline(in,line)) {while (getline(in,line)) { // // extract and check dataextract and check datasmatch matches;smatch matches;if (!regex_match(line, matches, row)) error("bad line");if (!regex_match(line, matches, row)) error("bad line");
int curr_boy = from_string<int>(matches[2]);int curr_boy = from_string<int>(matches[2]); // // check rowcheck rowint curr_girl = from_string<int>(matches[3]);int curr_girl = from_string<int>(matches[3]);int curr_total = from_string<int>(matches[4]);int curr_total = from_string<int>(matches[4]);if (curr_boy+curr_girl != curr_total) error("bad row sum");if (curr_boy+curr_girl != curr_total) error("bad row sum");
if (matches[1]==“Alle klasser”) {if (matches[1]==“Alle klasser”) { // // last line; check columns:last line; check columns:if (curr_boy != boys) error("boys don’t add up");if (curr_boy != boys) error("boys don’t add up");if (curr_girl != girls) error("girls don’t add up");if (curr_girl != girls) error("girls don’t add up");return 0;return 0;
}}
boys += curr_boy;boys += curr_boy;girls += curr_girl;girls += curr_girl;
}} Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2525
Application domainsApplication domains
Text processing is just one domain among manyText processing is just one domain among many Or even several domains (depending how you count)Or even several domains (depending how you count) Browsers, Word, Acrobat, Visual Studio, …Browsers, Word, Acrobat, Visual Studio, …
Image processingImage processing Sound processingSound processing Data basesData bases
MedicalMedical ScientificScientific Commercial Commercial ……
NumericsNumerics FinancialFinancial ……
Stroustrup/PPP - Oct'11Stroustrup/PPP - Oct'11 2626