Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012
-
Upload
scape-project -
Category
Technology
-
view
608 -
download
2
description
Transcript of Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012
![Page 1: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/1.jpg)
SCAPE
Sven Schlarb Austrian National Library
Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012
Large scale preservation workflows with Taverna
![Page 2: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/2.jpg)
SCAPE
What do you mean by „Workflow“?
• Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share
![Page 3: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/3.jpg)
SCAPE
Modularise complex preservation tasks
• Assuming that complex preservation tasks can be separated into processing steps
• Together the steps represent the automated processing pipeline
Migrate Characterise Quality Assurance Ingest
![Page 4: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/4.jpg)
SCAPE
Experimental workflow development
• Easy to execute a workflow on standard platforms from anywhere
• Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity
![Page 5: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/5.jpg)
SCAPE
Taverna
• Workflow language and computational model for creating composite data-intensive processing chains
• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK
• Available for Windows/Linux/OSX and as open source (LGPL)
![Page 6: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/6.jpg)
SCAPE
SCUFL/T2FLOW/SCUFL2
• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)
• SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API
• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)
• Adopts Linked Data technology
![Page 7: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/7.jpg)
SCAPE
Creating workflows using Taverna
• Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the
workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows
![Page 8: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/8.jpg)
SCAPE
Processors
• Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)
![Page 9: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/9.jpg)
SCAPE List handling: Implicit iteration over multiple
inputs • A „single value“ input port (list depth 0) processes
values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross
product or dot product
![Page 10: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/10.jpg)
SCAPE
Example: Tika Preservation Component
• Input: „file“
• Processor: Tika web service (SOAP)
• Output: Mime-Type
![Page 11: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/11.jpg)
SCAPE
Workflow development and execution • Local development: Taverna Workbench
![Page 12: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/12.jpg)
SCAPE
Workflow registry • Web 2.0 style registry: myExperiment
![Page 13: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/13.jpg)
SCAPE
Remote Workflow Execution • Web client using REST API of Taverna Server
![Page 14: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/14.jpg)
SCAPE
Hadoop
• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)
• Hadoop = MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB
(default) blocks
![Page 15: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/15.jpg)
SCAPE
Hadoop
• Job tracker (master) manages job execution on task trackers (workers)
• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)
• Name node manages HDFS, i.e. distribution of data blocks on data nodes
![Page 16: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/16.jpg)
SCAPE
Hadoop job building blocks
Map/reduce Application
(JAR)
Job configuration Set or overwrite configuration parameters.
Map method Create intermediate key/value pair output
Reduce method Aggregate intermediate key/value pair output from map
![Page 17: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/17.jpg)
SCAPE
Cluster
![Page 18: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/18.jpg)
SCAPE
Dette billede kan ikke vises i øjeblikket.
Apache Tomcat Web Application
Taverna Server (REST API)
Hadoop Jobtracker
File server
Cluster
Large scale execution environment
![Page 19: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/19.jpg)
SCAPE Example: Characterisation on a large document
collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming-
API, Hadoop Map/Reduce, and Hive) • Available on myExperiment:
http://www.myexperiment.org/workflows/3105 • See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna
![Page 20: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/20.jpg)
SCAPE
20
Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API.
![Page 21: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/21.jpg)
Reading image metadata
21
find
/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
: ~ 5 h + ~ 38 h = ~ 43 h 60.000 books
24 Million pages
SCAPE Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...
![Page 22: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/22.jpg)
SCAPE
22
Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS.
![Page 23: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/23.jpg)
SequenceFile creation
23
find
/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
: ~ 5 h + ~ 24 h = ~ 29 h 60.000 books
24 Million pages
SCAPE HtmlPathCreator SequenceFileCreator
![Page 24: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/24.jpg)
SCAPE
24
Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width.
![Page 25: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/25.jpg)
HTML Parsing
25
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005 ...
: ~ 6 h 60.000 books
24 Million pages
Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400
Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400
Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400
Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400
Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400
SCAPE
Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250
Map Reduce HadoopAvBlockWidthMapReduce
SequenceFile Textfile
![Page 26: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/26.jpg)
SCAPE
26
Create hive table and load generated data into the Hive database.
![Page 27: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/27.jpg)
Analytic Queries
27 : ~ 6 h
60.000 books 24 Million pages
SCAPE HiveLoadExifData & HiveLoadHocrData
Dette billede kan ikke vises i øjeblikket.
Dette billede kan ikke vises i øjeblikket.
htmlwidth
jp2width
Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700
Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250
CREATE TABLE jp2width (hid STRING, jwidth INT)
CREATE TABLE htmlwidth (hid STRING, hwidth INT)
![Page 28: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/28.jpg)
Analytic Queries
28 : ~ 6 h
60.000 books 24 Million pages
SCAPE HiveSelect
Dette billede kan ikke vises i øjeblikket. Dette billede kan ikke vises i øjeblikket.
htmlwidth jp2width
Dette billede kan ikke vises i øjeblikket.
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
![Page 29: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/29.jpg)
SCAPE
29
Do a simple hive query in order to test if the database has been created successfully.
![Page 30: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/30.jpg)
SCAPE
Example: Web Archiving
30
![Page 31: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/31.jpg)
SCAPE
Hands on – Virtual machine
• 0.20.2+923.421 Pseudo-distributed Hadoop configuration
• Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop
Command for DEMO1) • Latest patches
![Page 32: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/32.jpg)
SCAPE
Hands on – VM setup
• Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder =>
select VBOX file • VM instance login:
• user: scape • pw: scape123
![Page 33: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/33.jpg)
SCAPE
Hands on – Demo1
• Using Hadoop for analysing ARC files • Located at:
/example/sampleIN/ (HDFS) • Execution via command in:
SampleHadoopCommand.txt (on Desktop)
• Result can then be found at: /example/sample_OUT/
![Page 34: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012](https://reader031.fdocuments.net/reader031/viewer/2022020306/54833cefb07959520c8b49df/html5/thumbnails/34.jpg)
SCAPE
Hands on – Demo2
• Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2flow • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt
• Result: ~/scanARC/outputCSV/fullTIKAReport.csv
• See ~/scanARC/outputGraphics/ graphicsTIKA/tika-