CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH,...
-
Upload
osborn-andrews -
Category
Documents
-
view
218 -
download
0
description
Transcript of CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH,...
![Page 1: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/1.jpg)
CS4624S13P-Environment/VWRRCBEN KATZ ([email protected])ERIC HOTINGER ([email protected])
BLACKSBURG, VIRGINIA. CLASS: CS 4624 @ VIRGINIA TECH, CLIENT: VWRRC, DATE CREATED: 5/1/2013.
![Page 2: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/2.jpg)
Summary of Work
Document Extraction Document Parsing Website Parsing VTechWorks Configuration VTechWorks Upload VWRRC Website Advice
![Page 3: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/3.jpg)
Document Extraction
Extracted 394 documents from the Virginia Water Resources Center (VWRRC) using DownThemAll
Conference Proceedings, Bulletins, Special / Educational Reports, and Newsletters dating back to the 1970’s
![Page 4: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/4.jpg)
Document Extraction (cont.)
![Page 5: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/5.jpg)
Document Parsing
Parsed each PDF document for tags Apache PDFBox for PDF -> Text conversion OpenCloud for generation of tags
![Page 6: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/6.jpg)
Document Parsing: Output
![Page 7: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/7.jpg)
Website Parsing
Parsed website to obtain metadata about each publication Used JSoup along with regular expressions (Pattern class in Java) to
alleviate the pain of parsing HTML Involved splitting a list of authors like “Bob and Jane” by the
regexp “and” to obtain an author list with “Bob” as the first element and “Jane” as the second element. Simple example, but involved more complicated regexps because of
non-uniform data
![Page 8: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/8.jpg)
VTechWorks Configuration
Programatically generated xml configuration documents for each publication, in preparation for upload to VTechWorks Involved cleaning of titles and citations to fit VTechWorks quality
assurance requirements
![Page 9: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/9.jpg)
VTechWorks Configuration (cont.)
![Page 10: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/10.jpg)
VTechWorks Upload Preparation
Sent upload package (.zip) to library staffer, who verified our upload and sent to VTechWorks for processing/QA
Some bugfixing involved: had to add contents file which contains a list of all pdfs to be uploaded in a particular set
Rename directories to integers to make exports work from VTechWorks
![Page 11: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/11.jpg)
Website Improvements: The Old
![Page 12: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/12.jpg)
Website Improvements: The New
![Page 13: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/13.jpg)
Lessons Learned
Dirty data is difficult to manage
Communication is important
Stick to your timeline
Water links
![Page 14: CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,](https://reader035.fdocuments.net/reader035/viewer/2022062401/5a4d1b8c7f8b9ab0599bf131/html5/thumbnails/14.jpg)
Questions?