Feb.2016 Demystifying Digital Humanities - Workshop 2
-
Upload
paige-morgan -
Category
Education
-
view
1.145 -
download
2
Transcript of Feb.2016 Demystifying Digital Humanities - Workshop 2
Data Wrangling I:Exploring Programming in Digital
ScholarshipFebruary 19, 2016
Paige MorganDigital Humanities Librarian
Programming is complex enough that just figuring out what you want to do
and what sort of language you need is work.
Thinking that you ought to be able to do everything almost
immediately is a recipe for feeling terrible.
Working with technology means periodically starting
from scratch -- a bit like working with a new time
period or culture; or figuring out how to teach a new
class.
Being able to effectively communicate about your
project as it relates to programming is a skill in
itself.
Programming languages can...• search for things
• match things
• read things
• write things
• receive information, and give it back, changed or unchanged
• count things
• do math
• arrange things in quantitative or random order
• respond: if x, do y OR do x until y happens
• compare things for similarity
• go to a file at a location, and retrieve readable text
• display things according to instructions that you provide
• draw points, lines, and shapes
Example #1• find all the statements in quotes ("") from a
novel.
• count how many words are in each statement
• put the statements in order from smallest amount of words to largest
• write all the statements from the novel in a text file
Example #2• allow a user to type in some information, i.e.,
"Benedict Cumberbatch"
• compare “Benedict Cumberbatch” to a much larger file
• retrieve any data that matches the information
• print the retrieved information on screen
Example #3• "read" two texts -- say, two plays by Seneca
• search for any words that the two plays have in common
• print the words that they have in common on screen
• calculate what percentage of the words in each play are shared
• print that percentage onscreen
Example #4• if the user is located in geographic
location Z, i.e., Blue Road & S Red Road, retrieve some text from an online location
• print that text on the user’s tablet screen
• receive input from the user and respond
However...• In Example #1, the computer is focusing on
things that characters say. But what if you want to isolate speeches from just one character?
• In Example 2, how does the computer know how much text to print? Will it just print "Benedict Cumberbatch" 379 times, because that's how often it appears in the larger file?
These are the areas of programming where critical thinking and
specialized disciplinary knowledge become vital.
The Difference• Humans are good at differentiating
between material in complex and sophisticated ways.
• Computers are good at notdifferentiating between material unless they’ve been specifically instructed to do so.
Computers work with data.
You work with data, too -- but you may have to do extra work to make your data readable by computer.
How to make your data machine-readable• Annotate it with markup language
• Organize it in formats or structures that the computer can understand
• Add metadata that is not explicitly readable in the current format (i.e., hardbound/softbound binding; language:English; date of record creation)
Depending on the data you have, and the way
you annotate or structure it, different things become
possible.
Your goal is to make the data As Simple As
Possible -- but not so simple that it stops being
useful.
Depending on the data you work with, the work of structuring or annotating
becomes more challenging, but also
more useful.
Many programming languages have governing bodies that establish
standards for their use:
• World Wide Web (W3C) Consortium (www.w3.org/standards/)
• TEI Technical Council (www.tei-c.org)
Data Examples• Annotated (Markup Languages: HTML,
TEI)
• Formatted (JSON)
• Structured data (tabular, relational, non-relational)
• Object-Oriented Programming (Java, Python, Ruby on Rails)
Markup: HTML
<a href=“http://www.paigemorgan.net”>This text</a> will take you to a webpage.
=This text will take you to a webpage.
Markup: HTML
Anything can be data -- and markup languages provide instructions for how
computers should treat that data.
Markup: HTMLHTML is used to format text on webpages.
<p> separates text into paragraphs.
<em> makes text bold (emphasized).
These are just a few of the HTML formatting instructions that you can use.
HTML Syntax Rules
• Open and closed tags: <> and </>• Attributes (2nd-level information)
defined using =“”
Poetry w/ TEI<text xmlns="http://www.tei-c.org/ns/1.0" xml:id="d1">
<body xml:id="d2"><div1 type="book" xml:id="d3">
<head>Songs of Innocence</head><pb n="4"/><div2 type="poem" xml:id="d4">
<head>Introduction</head><lg type="stanza">
<l>Piping down the valleys wild, </l><l>Piping songs of pleasant glee, </l><l>On a cloud I saw a child, </l><l>And he laughing said to me: </l>
</lg>
Grammar w/ TEI<entry>
<form><orth>pamplemousse</orth>
</form><gramGrp>
<gram type="pos">noun</gram><gram type="gen">masculine</gram>
</gramGrp></entry>
TEI’s syntax rules are identical to HTML’s --though your normal
browser can’t work with TEI the way it works with
HTML.
Anything that you can isolate (and put in brackets) can
(theoretically) be pulled out and displayed for a reader.
TEI can be used to encode more than just text:
<div type="shot"><view>BBC World symbol</view>
<sp><speaker>Voice Over</speaker>
<p>Monty Python's Flying Circus tonight comes to you livefrom the Grillomat Snack Bar, Paignton.</p>
</sp></div><div type="shot">
<view>Interior of a nasty snack bar. Customers around, preferablyreal people. Linkman sitting at one of the plastic tables.</view>
<sp><speaker>Linkman</speaker>
<p>Hello to you live from the Grillomat Snack Bar.</p></sp>
</div>
Whether you include or exclude some aspect of the text in your markup can be very important
from an academic perspective.
The challenge of creating good data is one reason that collaboration is so
important to digital scholarship.
Wise Data Collaboration
• Avoid reinventing the wheel (has someone else already created an effective method for working with this data?)
• Consider the labor involved vs. the outcome (and future use of the data you create.)
Study Scenario #1
• You study urban espresso stands: their hours, brands of coffee, whether or not they sell pastries, and how far the espresso stands are from major roadways.
Study Scenario #2
• You study female characters in novels written between 1700 and 1850. Encoding a whole novel just to study female characters isn’t practical for you.
Structured Data: Example #1(Tabular Data)
ID Name Location Hours Coffee Brand Pastries (Y/N) Distance from Street
008 Java the Hut 56 FarringdonRoad, London, UK
7:00 a.m.-2:00 p.m.
Square Mile Roasters
N 25 meters
009 PrufrockCoffee
18 ShoreditchHigh Street
7:00 a.m. –10:00 p.m.
Monmouth Y 10 meters
Object-Oriented Programming
• Java, Python, C++, Perl, PHP, Ruby, etc.
• Widely used, highly flexible, very powerful
What’s an “object”?• An object is a structure that contains data in
one or more forms.
• Common forms include strings, integers, and arrays (groups of data).
• Example (handout)
Object-oriented programming, cont’d• Learning a bit about an OOP language can
help you become accustomed to working with programming
• Reading OOP code can also be useful
• Many free tutorials are available
• Goal: to be able to converse more effectively with professional programmers, rather than become an expert yourself.
Every project has data.
Text objects, images, tags, geographical coordinates, categories, records, creator
metadata, etc.