Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 [email protected].
-
Upload
david-verney -
Category
Documents
-
view
214 -
download
0
Transcript of Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 [email protected].
![Page 1: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/1.jpg)
![Page 2: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/2.jpg)
Data and text mining: the search for unknown knownsGeoffrey BilderUKSG, [email protected]
![Page 3: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/3.jpg)
"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."
![Page 4: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/4.jpg)
The Mining Metaphor
![Page 5: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/5.jpg)
![Page 6: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/6.jpg)
Gold Mining
![Page 7: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/7.jpg)
Diamond Mining
![Page 8: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/8.jpg)
Data Mining
![Page 9: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/9.jpg)
Data Mining- What it isn’t
![Page 10: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/10.jpg)
≠ Information Retrieval
![Page 11: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/11.jpg)
≠ Information Extraction
![Page 12: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/12.jpg)
≠ Information Analysis
![Page 13: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/13.jpg)
+ +
InformationRetrieval
InformationExtraction
InformationAnalysis
![Page 14: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/14.jpg)
Data Mining new, previously unknown information
![Page 15: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/15.jpg)
And so what is text data mining?
![Page 16: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/16.jpg)
Text Mining
![Page 17: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/17.jpg)
![Page 18: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/18.jpg)
+ +
InformationRetrieval
InformationExtraction
InformationAnalysis
![Page 19: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/19.jpg)
![Page 20: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/20.jpg)
Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to
“publish”?
![Page 21: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/21.jpg)
So how did we get here?
![Page 22: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/22.jpg)
• The word tobacco originates from the Taino indians.
• There is no I in the word Team.
• The book captured the zeitgeist of the time.
• I am sure that I turned the gas off.
![Page 23: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/23.jpg)
The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase> of the time.
I am <emphasis>sure</emphasis> that I turned the gas off.
![Page 24: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/24.jpg)
![Page 25: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/25.jpg)
![Page 26: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/26.jpg)
Semantic Web “Light”
![Page 27: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/27.jpg)
![Page 28: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/28.jpg)
![Page 29: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/29.jpg)
![Page 30: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/30.jpg)
![Page 31: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/31.jpg)
![Page 32: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/32.jpg)
But we can do more...
![Page 33: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/33.jpg)
The web as a database
![Page 34: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/34.jpg)
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-
0811200127New
Directions
Hopscotch Julio Cortazar978-
0394752846Pantheon
The AlephJorge Luis
Borges978-
0140286809Penguin
... ... ... ...
The Relational Model
![Page 35: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/35.jpg)
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-
0811200127New
Directions
Hopscotch Julio Cortazar978-
0394752846Pantheon
The AlephJorge Luis
Borges978-
0140286809Penguin
... ... ... ...
Rows represent things
![Page 36: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/36.jpg)
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-
0811200127New
Directions
Hopscotch Julio Cortazar978-
0394752846Pantheon
The AlephJorge Luis
Borges978-
0140286809Penguin
... ... ... ...
Columns are properties
![Page 37: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/37.jpg)
Title Author ISBN-13 Publisher
LabyrinthsJorge Luis
Borges978-0811200127 New Directions
Hopscotch Julio Cortazar 978-0394752846 Pantheon
The AlephJorge Luis
Borges978-0140286809 Penguin
... ... ... ...
The book has an author “Jorge Luis Borges”
The thing’s property
Subject Predicate Object
![Page 38: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/38.jpg)
The book has an author “Jorge Luis Borges”
Subject Predicate Object
URI URI
![Page 39: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/39.jpg)
http://www.amazon.com/isbn/978-0140286809has an author
http://www.wikipedia.com/borges
RDF: Resource Description Framework
![Page 40: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/40.jpg)
Journal A Journal B
Wiki
Blog
Personal Website
OPAC
![Page 41: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/41.jpg)
Journal A Journal B
Wiki
Blog
Personal Website
OPAC
![Page 42: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/42.jpg)
![Page 43: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/43.jpg)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT DISTINCT ?nameWHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name}ORDER BY ?name
SPARQL
http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss
![Page 44: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/44.jpg)
![Page 45: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/45.jpg)
![Page 46: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/46.jpg)
RSS 1.0
FRBR
Creative CommonsFOAF
Geo
SKOS
![Page 47: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/47.jpg)
The Early Modern Internet
![Page 48: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/48.jpg)
Data Mining =
With the goal of discovering new, previously unknown information
Information retrieval +Information extraction +Information analysis...
![Page 49: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/49.jpg)
Data Mining =
Text Data Mining =
With the goal of discovering new, previously unknown information
Complex data extraction layer +data mining
Information retrieval +Information extraction +Information analysis...
![Page 50: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/50.jpg)
![Page 51: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/51.jpg)
![Page 52: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/52.jpg)
![Page 53: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/53.jpg)
![Page 54: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/54.jpg)
Why do we publish text?
![Page 55: Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org.](https://reader037.fdocuments.net/reader037/viewer/2022110205/56649c7d5503460f94932a77/html5/thumbnails/55.jpg)
Thank [email protected]