Text Editing

14
Text Editing Kim Shepherd [email protected] Digital Development Team The University of Auckland Library Tools, tips, tricks LIANZA ITSIG webinar series

description

LIANZA ITSIG webinar series. Text Editing. Tools, tips, tricks. Kim Shepherd [email protected] Digital Development Team The University of Auckland Library. Summary. General (large) text files We manage and manipulate text data daily It’s tedious and time consuming - PowerPoint PPT Presentation

Transcript of Text Editing

Page 1: Text Editing

Text Editing

Kim [email protected]

Digital Development TeamThe University of Auckland Library

Tools, tips, tricks

LIANZA ITSIG webinar series

Page 2: Text Editing

Summary

• General (large) text files– We manage and manipulate text data daily– It’s tedious and time consuming– Find & Replace is too limited and dangerous– We know there must be a better way...

• Tabular data files (eg. Spreadsheets)– We work with these all the time, usually in Excel– What tools can help us clean messy data?

Page 3: Text Editing

Topics

• Regular Expressions

• Text Editors

• Operating on lines, not entire files

• Google Refine

Page 4: Text Editing

Regular Expressions

  /^\s+[a-zA-Z0-9](?:\W+)/

Page 5: Text Editing

Regular Expressions

• A way to describe a set of strings and capture parts of them

• Originated in old UNIX/POSIX tools

• Now used all over the place

• Test your regexes out on the web:– http://gskinner.com/RegExr/

Page 6: Text Editing

Text Editors & Useful Languages

sed, grep, awk

Page 7: Text Editing

Text Editors

• Word processors aren’t text editors

• Shop around, compare features

• My favourite: Vim (UNIX, Windows, Mac)

– Wikipedia comparison of editor features– Wikipedia list of regex software

Page 8: Text Editing

Useful Languages / Interpeters

• Perl– An old favourite, great for string manipulation

• Python– The cool kids tell me it’s better than Perl

• GREL– We’ll get to this later...

Page 9: Text Editing

Line-by-line processing

while(<STDIN>) {....

}

Page 10: Text Editing

Line-by-line processing

• Large files are large!– If they’re big on disk, they’ll be big in memory

• Lines are (usually!) small– Read a line– Do something with it– Output the modified line

Page 11: Text Editing
Page 12: Text Editing

Google Refine

• Cleans messy tabular data– Easy facetting and filtering of columns/values– Easy transformation of values

• Google Refine Expression Language (GREL)– Extensive use of regular expressions and other standard string

manipulation techniques

• Other features– Perform web service calls directly, reconcile row IDs

Page 13: Text Editing
Page 14: Text Editing

Conclusion

• Our problems are solvable!– Regular expressions– Decent text editors for general/unformatted text– Google Refine for tabular data

• Contact me– Please feel free to contact me with questions, corrections or

ideas– [email protected]– Twitter: @kimshepherd– Google+: [email protected]