Text mining tools for semantically enriching scientific literature
UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation
-
Upload
claire-lamb -
Category
Documents
-
view
32 -
download
0
description
Transcript of UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation
UK e-Science
Future Infrastructure for Scientific DataMining, Integration and Visualisation
Malcolm Atkinson
Director of National e-Science Centrewww.nesc.ac.uk
25th October 2002
SDMIV workshop, e-Science InstituteEdinburgh
Overview
UK e-ScienceReminder of Investment and Infrastructure
International e-ScienceExamples and Collaboration
Data Access and IntegrationLego Bricks for Scientific Application DevelopersTailored: Application and Computing Scientists
A Computer Scientist’s Christmas ListDiversity and Opportunity
The Way Ahead
e-Science
Fundamentally about CollaborationSharing
Ideas Thought processes and Stimuli Effort Resources
Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure
Scientists (Biologists) have done this for Centuries
e-Science (take 2)
Fundamentally about CollaborationSharing
Ideas Thought processes and Stimuli Effort Resources
Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure
Text, digital media, structured, organised & curated data, computable
models, visualisation, shared instruments, shared systems,
shared administration, …
Nationally & Internationally Distributed, …
Routine, Daily, Automated, …
That Requires very Significant Investment in DigitalSystems and their Support
e-Science (take 3)
Fundamentally about CollaborationSharing
Ideas Thought processes and Stimuli Effort Resources
Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure
Digital networks, digital work-places, digital
instruments, …
Metadata, ontologies, standards, shared curated
data, shared codes, …
Common platforms, shared software, shared training, …
The Grid SHOULD make this much easier byproviding a common, supported high-level of Software and Organisational infrastructure
Citation, Authentication, Authorisation, Accounting,
Provenance, Policies, …
Shared Provision of Platform,
Grid ExpectationsPersistence
Always there, Always Working, Always Supported
StabilityYou can build on foundations that don’t move
Trustworthy & PredictableHonours commitments
Digital policies, digital contracts, security, … Data integrity, longevity and accessibility Performance
High-level & ExtensibleThe capabilities you need are already there
UbiquitousYour collaborators use it
Grid RealityPersistence
Always there, Always Working, Always Supported
StabilityYou can build on foundations that don’t move
Trustworthy & PredictableHonours commitments
Digital policies, digital contracts, security, … Data integrity, longevity and accessibility Performance
High-level & ExtensibleThe capabilities you need are already there
UbiquitousYour collaborators use it
Political, Economic & Technical issues to Solve
Early days but Open Grid Services link with Web
Services + GGF standardisation
Not yet but very substantialglobal effort to achieve this
Good basis for extensionCommitment to basic functionality
WS + Community effort
Global & Industrial Rallying CryMust work with Web Services
Cambridge
Newcastle
Edinburgh
Oxford
Glasgow
Manchester
Cardiff
Southampton
London
Belfast
Daresbury Lab
RALHinxton
UK Grid Network
Nationale-
ScienceCentre
Access Grid always-on video always-on video wallswalls
HPC(x)
Scotland via Glasgow
NNW
Northern Ireland
MidMAN
TVN
South WalesMAN
SWAN&BWEMAN
WorldComGlasgow
WorldComEdinburgh
WorldComManchester
WorldComReading
WorldComLeeds
WorldComBristol
WorldComLondon
WorldComPortsmouth
Scotland via Edinburgh
YHMAN
NorMAN
EMMAN
EastNet
External Links
LMN
KentishMANLeNSE
10Gbps
622Mbps155Mbps
SuperJanet4, June 2002 20Gbps
2.5Gbps
Tony Hey July 2001
National e-Science Centre
EventsWorkshopsResearch MeetingsInternational Meetings
History of EventsGGF5HPDC11Summer school> 50 workshops held> 1000 people in totalMany return often
Planned Events25 workshops Conferences to 2005
Visitors3 arrived4 arranged
International collaboration, visits & visitors
ChinaArgonne National LabSDSCNCSA…
Centre ProjectsPilot ProjectsRegional SupportResearch Projects
EPSRC, MRC, WT, SHEFC
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
DataGrid Testbed
Dubna
Moscow
RAL
Lund
Lisboa
Santander
Madrid
Valencia
Barcelona
Paris
Berlin
LyonGrenoble
Marseille
BrnoPrague
Torino
Milano
BO-CNAFPD-LNL
Pisa
Roma
Catania
ESRIN
CERN
HEP sites
ESA sites
IPSL
Estec KNMI
(>40)
[email protected] - [email protected]
Testbed Sites
A Simplified Grid Anatomy
Grid Plumbing & Security Infrastructure
Scheduling Accounting Authorisation
Monitoring Diagnosis Logging
Scientific Application
Data & Compute ResourcesOperationsTeam
ApplicationDevelopers
Distributed
Owners
Scientific Users
The Crux
Grid Plumbing & Security Infrastructure
Scheduling Accounting Authorisation
Monitoring Diagnosis Logging
Scientific Application
Data & Compute ResourcesOperationsTeam
ApplicationDevelopers
Distributed
Owners
Scientific Users
Keep all the (pink)groupsHAPPY
A SDMIV Grid Anatomy
Grid Plumbing & Security Infrastructure
Scheduling Accounting Authorisation
Monitoring Diagnosis Logging
Scientific Application
Data & Compute Resources
Distributed
SDMIV Users
Data Access
Data Integration
Structured DataData ProvidersData Curators
Data Mining:Science vs Commerce
Data in files FTP a local copy /subset.ASCII or Binary.Each scientist builds own analysis toolkit Analysis is tcl script of toolkit on local data.Some simple visualization tools: x vs y
Data in a database
Standard reports for standard things.Report writers for non-standard thingsGUI tools to explore data.
Decision treesClusteringAnomaly finders
Jim Gray UCSC April 2002
But…some science is hitting a wallFTP and GREP are not adequate
You can GREP 1 MB in a secondYou can GREP 1 GB in a minute You can GREP 1 TB in 2 daysYou can GREP 1 PB in 3 years.
Oh!, and 1PB ~10,000 disks
At some point you need indices to limit searchparallel data search and
analysisThis is where databases can help
You can FTP 1 MB in 1 secYou can FTP 1 GB / min (= 1 $/GB)
… 2 days and 1K$… 3 years and 1M$
Jim Gray UCSC April 2002
50,000 Kg250 KW60 Racks = 120m2
Web Services Grid Technology
Grid Services
OGSA & OGSI
www.gridforum.org/ogsi-wgwww.gridforum.org/ogsa-wgwww.gridforum.org/
Web ServicesRapid Integration
Dynamic binding
Commercial PowerFinancial & Political
IndependenceClient from ServiceService from Client
SeparationFunction from Delivery
DescriptionWSDL, WSC, WSEF, …
Tools & PlatformsJava ONE, Visual .NETWebSphere, Oracle, …
www. w3c. org / TR / SOAP or TR/wsdl
Grid TechnologyVirtual Organisations
Sharing & CollaborationSecurity
Single Sign in, delegationDistribution & fast FTP
But Various ProtocolsResource Mangement
DiscoveryProcess CreationSchedulingMonitoring
PortabilityUbiquitous APIs & Modules
Gov’nm’t Agency Buy inIndustrial Buy in
Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001 http://www.gridforum.org/ogsi-wg
Open Grid Services Architecture
Virtual Grid Services
Applications
Multiple implementations of Grid Services
Using operations
Implemented by
OGS infrastructureFoster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration
Scientific Data
Deluge of DataExponential growth
Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months
Not How big it is but
Scientific Data
Deluge of DataExponential growth
Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months
Not How big it is butWhat you do with it
SharingCurationMetadataAutomated movement, access & integrationComputational Access
Scientific Data
Deluge of DataExponential growth
Doubling timesAstronomy 12 monthsBio-Sequences 9 monthsFunctional Genomics 6 monthsBytes/dollar 12 to 18 months
Not How big it is butHow you Embrace & Manage Change
The Database is a Knowledge chestThe Database is a Communication HubAutonomously Managed (Curated) changeAn Essential part of e-BioMedical, Astronomical, …, Science & Engineering
Wellcome Trust: Cardiovascular Functional Genomics
Glasgow Edinburgh
Leicester
Oxford
LondonNetherlands
Shared dataPublic curated
data
Data Access & Integration
Central to e-ScienceAstronomy, Earth Sciences, Ecology, Biology, Medicine, …
Collaboration Shared Databases Curated Knowledge Accumulated Observations Accumulated Simulations
Computation Data mining Input to models Calibration of models
Presentation Publication of results Visualisation
GGF DAIS WGChairs
Norman Paton (Manchester Uni.)Leanne Guy (CERN)Dave Pearson (Oracle UK)
ActivityBoF GGF4 TorontoWG Meeting GGF5 EdinburghPapers for GGF6Workshops & Mail lists
GoalsAgree Standards for Database Access & IntegrationFreely available reference implementations
OGSA-DAI one source & focus for discussions
Norman Paton,Inderpal Narang,
Leanne Guy, Susan Maliaka, Greg Ricardi, …
http://www.cs.man.ac.uk/grid-db/
OGSA-DAI project
Lego kit for Data Access & IntegrationComponents for e-Science Applications
Accelerated Application DevelopmentMultiple Data Models
Distributed DataAccess via Grid & Proxies
Integration, Translation & Transformation
Open Source Reference Implementation
For DAIS-WG standard
Trigger for Component ConstructionStart a community
Oxford
Glasgow
Cardiff
Southampton
London
Belfast
Daresbury Lab
RAL
OGSA-DAI Partners
EPCC & NeSC
Newcastle
IBMUSA
IBM Hursley
Oracle
Manchester
EPCC & NeSCIBM UKIBM USAManchester e-SCNewcastle e-SCOracle £3 million, 18 months, started February 2002
Cambridge
Hinxton
Composed Components
Translation
Consumer
GDS
Translation
GDT
GDS:performScript
GDT
GDT
Client
GDS:performScript
GDS:performScript
GDS:performScript
Composing Components
OGSA-DAIComponent
OGSA-DAIComponent
OGSA-DAIComponent
Data Transport
Data Transport
Data Transport
Data Transport
DAI Key Components
GridDataService GDS Access to data & DB operations
GridDataServiceFactory GDSF Makes GDS & GDSF
GridDataServiceRegistry GDSR Discovery of GDS(F) & Data
GridDataTranslationService Translates or Transforms Data
GridDataTransportDepot GDTD Data transport with persistence
Relational & XML models supportedRole-based AuthorisationBinary structured files
OGSA Relationship
Class GridService Registry NotificationConsumer NotificationProducer
GDS Mandatory Optional Normal
GDSF Mandatory Optional Normal
GDSR Mandatory Mandatory Normal
GDTS Mandatory
GDTD Mandatory Optional Normal
DAI portType Usage
Class GridDataService DataTransport Factory
GDS Mandatory Normal
GDSF Optional Normal Mandatory
GDSR Optional
GDTS Optional Mandatory
GDTD Optional Mandatory
Distributed Query
Registry R
Client
Consumer GDT
GDS
GDTV
DQP
GDT
GDTV
GDS
QPM
NS
F Factory
Evaluator
GDTV GDT
Evaluator
GDTV GDT
Evaluator
GDTV GDT
GDS
GDS
GDTV DB
T
Q
T
PNM
T
PNM
GDS
T
GDTV
D Q P : D is t ribu te d Q u e ry Pro ce s s o rG D T : G rid D a ta Tra n s po rtT : Tra n s la t io nQ : Q u e ryG D TV : G rid D a ta Tra n s po rt V e h icleF : Fa cto ryQ PM : Q u e ry Pro g re s M o n ito rPNM : Pro g re s s No t if ica t io n M e s s a g eA M : A pplica t io n M e ta da taC R M : C o m pu ta t io n a l R e s o u rce M e ta da taNS : No t if ica t io n S in k
1
2
5
3
4
5
5
7
7
6
6
7
7
7
7
(7) 8
6
OGSA-DAI Time Line
Feb ’02 May ’02 Jul ’02 Sep ’02 Dec ’02 Feb ’03 May ’03 Sep ’03
Ship Alpha Release for GT3 Integration
RDB + GT2 / OGSA Prototypes Available
XML + OGSA Prototype Available
Design Documents & Demos for DAIS WG @ GGF5
XML + OGSA Prototypes for Early Adopters
WS + GSI UK support ( > 100 downloads)
Phase 2 StartsPhase 1 Starts
Presentation & Beta @ GGF7
GGF6 WG Papers & Prototypes
Productisation, RAMPS &Extension
OGSA-DAI Summary
On Schedule & Going WellContributions via DAIS-WG @ GGF5 & 6Releases with GT3 Releases scheduledStatus: Early Days
Released prototypesTested Architectural DesignUsing OGSAWorking with Early Adopter Pilot Projects
AstroGrid & MyGrid
First PRODUCT release Dec ‘02
Influence OGSA-DAI directionVia DAIS-WG & Direct messages to us
Data Processing
Processing Characteristics-Well defined work flow-Correction, calibration, transformation,filtering, merging-Relatively static reference data-Stable processing functions (audited changes)-Periodic reprocessing from archive
Instrument
Raw Data
Reference Data
Multi-stageProcessing
ProcessedData
Archive Archive
In Silico
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Analysis and Interpretation
Summarisation
ProcessedData
Archive
SummarisedData
Analysis Characteristics
- Variable workflow
- Standard functions
- Standard and personal
filtering and summarisation
- Retain drill down capability
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Analysis and Interpretation
Analysis and Interpretation Characteristics- Highly dynamic work flow- Multiple data types- Volatile data- Annotations, inferences, conclusions- Evidential reasoning- Shared multiple versions of truth- Periodic version consolidation
ProcessedData
Result dataRetrieval &Update
SummarisedData
PersonalisedDatabase Conclusions/InferencesConclusions/Inferences
- DescriptionsDescriptions- TrendsTrends- CorrelationsCorrelations- RelationshipsRelationships
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Metadata Requirements
Technical Metadata Direct referencing - Physical location and data schema/structureData currency/status – version, time stampingAccreditation/Access permissions - Ownership (Dublin Core)Query time/Governance - data volume, no. of records, access paths
Contextual MetadataLogical referencing physical data – semantic/syntactic ontologiesLexical translation – Thesaurus, ontological mappingNamed derivations (summarisations)
Scope of RequirementsAll science communitiesRelated to provenance
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Metadata Requirements
Data VersioningDistinguish latest/agreed version of dataMaintain history record of changeSynchronise and mirror replicated dataDistinguish shared personal interpretations and/or annotations
Provenance Record of data processing – calibration, filtering, transformationRecord of workflow – methods, standards and protocolsReasoning – evidential justification for inferences & conclusions
Scope of RequirementsAll science communitiesIncludes Technical and Contextual Metadata
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Provenance Issues
Schema evolutionGranularity of record
Processed v Derived
InheritanceLack of structured annotations, ontologiesInteractive analysis = dynamic workflowMultiple derived data sourcesContext of usageBest practice can changeMultiple versions of the truthEvidential reasoningExisting data & applicationsWhere is the provenance record stored
Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago
Collaborative Annotation
See DASDistributed Annotation ServiceChallenges
Autonomy Selective viewing Identification Provenance Derivation
Biomedical e-Scientists
Is this one species?Understanding bird energyUnderstanding a river / ocean interactionUnderstanding a biochemical pathwayUnderstanding a cellUnderstanding a Heart or BrainUnderstanding RhododendraUnderstanding Evolution…
No One-Size fits all solutionsBut sharable re-usable components
Opportunities
Many, many …More than we can addressCompute needsData management needsData integration needs…
Must choose some pioneersTo meet a range of common requirementsTo provoke rich & high-level platformTo generate re-usable components
A Long-Term Commitment Needed
Advancing SDMIV Grid
Grid Plumbing & Security Infrastructure
Scheduling Accounting Authorisation
Monitoring Diagnosis Logging
Scientific Application
Data & Compute Resources
Distributed
SDMIV Users
Data Access
Data Integration
Structured Data
SDMIV (Grid) Application Component Library
Summary
e-ScienceData as well as Compute Challenges
Needed to be put together
Need ubiquitous supported consistent platforms
GridA (potentially) invaluable platformOnly show in town
Data IntegrationHard Develop & Use Standard kit of partsStarted to build the kitNo ready made general integrationCombines application & computing science
OpportunitiesNo one-size fits all, but re-usable subsystemsInvest in wider range of Problem driven pioneeringStrategic choices needed