Big Data & Data Science: A Practitioner’s PerspectiveBig Data & Data Science: A Practitioner’s...
Transcript of Big Data & Data Science: A Practitioner’s PerspectiveBig Data & Data Science: A Practitioner’s...
-
Big Data & Data Science:A Practitioner’s Perspective
Arcot [email protected]
The University of North Carolina at Chapel Hill
-
Outline• Challenges in Big Data & Data Science
– Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges
– Some Projects at UNC, Chapel Hill• Looking Towards the Future
– Integration of Data, Computing & Networks
-
Big Data EveryWhere! • Lot of data collected and analyzed
– Sensors and Instruments– Large Scientific projects– Web data, e-commerce– Commercial/Financial
transactions– Social Network data– Medical & Health Information– Smart Cities
-
Lets Start with an Analogy
-
Data - Today
-
Data - Tomorrow
-
Characteristics of Big DataFive Vs -– Volume – Exponential Increase in Size & Count– Velocity – Speed at which Data is Created,
Processed or Used
– Variety – Multi-dimensionality, arrangement, format,… – Veracity – Integrity & Fidelity– Value – Worth
– Findability– Availability
-
Four Kinds of Big Data (1)• Archetypal Big Data
– Science Projects – LHC, LSST, SCEC, OOI, …– Business/Industry – Genomics, Finance, Pharma,…– Government – NASA, NOAA, NCDC, …
• Volume –High – large datasets• Velocity – High but predictable
– Few Sources, Multiple Destinations• Variety – Low – Standardized Formats
– Few Varieties • Veracity – High Fidelity and Credible
– High Quality Metadata, Corrected data• Value – High – focused, funded
– Managed by Professionals • Findability – High – Known distribution sites
– Set Policies for Sharing and Administration• Availability – High – Published API
– Curated for Long-term PreservationLight & Visible Data
8
infographics.socialnama.comwired.com
https://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Ynhttps://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Yn
-
Four Kinds of Big Data (2)• Crowd-Sourced Big Data
– Social Media – Facebook, Twitter, Instagram, …– Recommenders – Yelp, Angie’s List, Groupon– Web Commerce – Amazon, Ebay, Orbitz, eNews
• Volume –High – small data• Velocity – High and non predictable
– Multiple Sources/Destinations, Few Concentrations• Variety – High – But well managed
– Site Specific• Veracity – Mixed - Low to High
– Crowd Sourced – what do you expect!!• Value – Ephemeral
– Can be None to High• Findability – High – Advertised & Known
– Web pages and Apps• Availability – Immediate Interest
– Long-term Availability is iffy
Nova-like Data 9
www.plannedparenthood.org-
dreamstime.com
https://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Ynhttps://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Yn
-
Four Kinds of Big Data (3)• Long-tail Big Data
– Science Projects – small teams and organizations– Personal – Hobbies, Amateur/Citizen Science/Arts– Government – Internal and unpublished
• Volume – High – small data sets– Highly Distributed , Isolated and Hidden
• Velocity – Low– Internal usage
• Variety – High – Too many, No Common Format – Project specific, Idiosyncratic
• Veracity – Non Credible until proven– No Metadata or Non-standard Metadata– Fidelity and Integrity unknown
• Value – unknown– Earned by reputation
• Findability – None – Hidden and not advertised– No Sharing , No Services– No Administration or Management
• Availability – None – In local, disks and tapes– No Notion of Long-term Preservation
Dark Data 10
teradata.com
Images.frompo.com
https://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Ynhttps://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Yn
-
Four Kinds of Big Data (4)• Sensor Streams
– Environmental & Geoscience Sensors – Smart Cities & Internet of Things – Ubiquity– Personal – Wearables, Health and Home (Fitbit, iPhones)
• Volume – High – small packets to HD video– Highly Distributed , Few concentrators
• Velocity – High and time-critical– Internal usage
• Variety – High – Too many Types – Sensor/Vendor specific
• Veracity – Credible (until proven)– Non-standard Metadata
• Value – unknown (as yet)– New Applications
• Findability – None – Hidden and not advertised– No Sharing , Closed Services– No Data Management
• Availability – None After Immediate – Very Low Long-term Preservation
Dark Data 11
-
Four Kinds of Big DataArchetypal
Science ProjectsLHC, LSST, SCEC
Business/IndustryGenomics, Finance
GovernmentNASA, NOAA, DOE
Crowd-SourcedSocial Media
Facebook, Twitter
RecommendersYelp, Angie, Groupon
Web CommerceAmazon, Ebay
Long-tailScience Projects
Small organizations
Personal Hobbies, Citizen Science/Arts
GovernmentInternal and unpublished
Sensor StreamsInternet of Things
Appliances, Homes
Smart CitiesEnergy grids, Transportation
HealthBiosensors, ER,OR
Characterization Archetypal Crowd-Sourced Long-tail Sensor Streams
Volume High High High High
Velocity High Bursty Low High
Variety Low High High High
Veracity High Mixed Low Mixed
Value High Ephemeral Unknown Huge
Findability High High None None
Availability High Short-term None Low
-
Big Data Paradigm ShiftWe need to know more about Data Science because we are in the midst of a paradigm shift: not only we have big data,
the way we do Science, Research and Business is changing
• Compute Intensive to Data Intensive
• Large Actions on Small Amounts of Data toSmall Actions on Large Numbers of Data
• Move Data to Processing Site (Supercomputer Model, Warehouse Model)Move Process to Data Site (Map-Reduce Model, Federation Model)
• Function Chaining (Programs) to Service Chaining (Workflows and DataFlows)
Leading to a Large Paradigm Shift:• Model-based Science /Business (Observe-Hypothesize-Test)
Data-based Science /Business (Data Mining, Knowledge Discovery)
Data Science is needed to bridge the future
-
Data Science
Computer Science
Information Science
SocialScienceMath, Logic,Statistics, OR, …
Domain SciencesEngineering EconomicsMedicalLegal
…
-
Outline• Challenges in Big Data & Data Science
– Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges
– Some Projects at UNC, Chapel Hill• Looking Towards the Future
– Integration of Data, Computing & Networks
-
Building a Big Data Platformfor The First & Third Kind
-
Big Data Problems• Where is the processing hosted?
– Distributed server/cloud (eg. Amazon EC2, Microsoft Azure)• Where data is stored?
– Distributed Storage (eg: Amazon S3)• Where is the programming model?
– Distributed processing (eg. Google’s Map Reduce - Hadoop)• How data is stored and indexed? (eg. Apache Cassandra)
– High performance schema free database• What operations are performed on the data?
– Analytic/Semantic Processing (eg. SAS, KNIME, ..)
• How to make data available?• How to manage the data?• How to find it?
-
Building Towards Big Data• Storage Resource Broker (SRB) (1996-2006)
– Massive Data Analysis System (DARPA)• Super computing
– Distributed Object Computation Testbed (DARPA, USPTO, NARA)• Distributed Data Flow
– National Partnership for Advanced Computing Infrastructure (NSF)• Science Data and Metadata Management
– Transcontinental Persistent Archives Prototype (NARA) • Infrastructure Independence
• Integrated Rule Oriented Data Systems (iRODS) (2005->– Life Time Library (SILS)
• Personal Digital Library– Carolina Digital Repository (UNC)
• Institutional Repository– DataNet Federation Consortium (NSF)
• National-scale Cross Disciplinary Collaboration• Data Bridge CI (2013->
– Data Bridge (NSF)• Long-tail of Science “Data Communities”
– Data Bridge for Neuroscience (NSF)– …
-
Big Data of the First Kind:iRODS Distributed Data Management
-
UserWith Client Views & Manages Data
My DataDisk, Tape, Database, Streams, Filesystem,
etc.
The iRODS Data System can install in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection.
iRODS Shows Unified “Virtual Collection”
Data in the Cloud
Disk, Tape, Database, Filesystem, etc.
User Sees Single “Virtual Collection”with metadata & services
Partner’s DataRemote Disk, Tape,
Files, DBs, etc.
-
Ingredients of iRODS• Virtualization
– Independence from Technology - Namespaces– Third Party Authorization, Authentication, Auditing, Accounting, Mediation
• Automation thru Policies (Rules & Microservices)– Event-based and periodic actions– Orchestration of Workflows and Provenance Capture – Repeatable Data Science
• Metadata– Integration of Context– Mediation across Multiple Metadata Systems
• Federation – Three types of Federation– Link systems/resources together
-
CollectionPurpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines ProcedureControls Updates
Client Action
Periodic Assessment
Criteria Policy
Policy Enforcement
Point
Workflow
Invokes
HasSubType Isa
Micro-service
Chains
Operation
Isa
PersistentState
Information
Isa
Digital Object
Updates
Has
Has
Replication
Checksum
Quota
Data TypeIsa
IsaIntegrity
Isa
AuthenticityIsa
Access control
Isa
msiGetUserACL
msiSetDataType
msiSetQuota
msiDataObjRepl
msiSysChksumDataObj
Isa
Isa
Isa
Isa
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa
Policy-based Collection Management
Isa
Isa
HasFeature
Invokes
-
Datanet Federation Consortiumhttp://datafed.org/
VISION:Enable Collaboration across Scientific DomainsSupport reproducible data-driven researchBuild a National Scale Data Cyberinfrastructure
IDEAS:Build a Federated Environment
Data SharingMetadata –based DiscoveryWorkflow OrchestrationPolicy as basis for collaboration
-
Research Environment - Portals, Applications, Workflows
DFC Collaboration Environment –Data Grid
Community Resource – Repository, Catalog
DataNet Federation Consortium Vision• Enable collaborative research
– Sharing of data, information, and knowledge• Build national-scale cyber-infrastructure
– Federation of existing data management systems
• Support reproducible data-driven research– Encapsulate knowledge in shared workflows
• Enable student participation in research– Policy-controlled access to “live” data
-
Data Driven Science and Engineering• Collaboration Environments
– Oceanography – Ocean Observatory Initiative• Archiving of climatic data records from real-time
sensor data streams, replay of sensor data– Engineering – CIBER-U
• Engineering Digital Library: curation of civil engineering data, student training materials
– Hydrology - CUAHSI, …• Automation of hydrology research workflows
(reproduce, reuse and repurpose)– Plant Biology – iPlant Collaoratory
• Project data I sharing and integration, virtualized metadata services
– Social Science – Odum Institute• Survey data and Statistical data processing
– Cognitive Science – Temporal Dynamics Learning• Inter-team collaboration policies, human data
Engineering Representation
-
Federation is Central to DFC• DFC exposes three models of Federation
– Strong Federation• Full and complete protocol-level federation across grids• Seamlessly Move from one grid to another • Used in DFC to federate Science & Engineering grids
– Weak Federation = peering• One-way DFC to External Micro-services and Workflows• DFC needs to ‘know’ the external protocol - plug-ins & wrappers• Used in DFC
– To access THREDDS (netCDF), Sensor system, federal data resources– SEAD, DataONE, and CUAHSI-HIS , AWS
– Asynchronous Federation = loose coupling• Message used to interact with external system• Easy to connect (similar to Data Bridge)• Used in DFC to expose indexing and format conversion services
-
Three Federations in DFC
Eng
Mar
Main
Hydr
SoL
Soc
Bio
THREDDSSEAD
TerraPop
NASA/ NCCS
NCDC
EC2S3
ARTS/ORB
DataBook
Vivo Indexer
HIVE Ontology 2
HIVE Ontology 1
Elastic Search
Format Indexer
SOLR
JENA
Data Verse
DataONE
-
Policies Govern the DFC
Policies for automatingdata management
StandardsGroups
InternationalProjects
AdvisoryCommittee
Science &Engineering
Domains
SustainabilityAnd
Institutions
FacilitiesAnd
Operations
TechnologyAnd
Research
EducationAnd
OutreachPolicies
AndStandards
Policies forpublication& federation
Policies for IPR & citations
Policies forprovenance & sustainability
Policies forcollaboration and reuse
Policies fortechnology migration
Policies formetadata extraction
Policies foranalysis and workflow
Policies for change management
Domain-centricpolicies
Policies forauthentication & authorization
Policies forarchiving, staging & caching
Policies forreplication & synchronization
Policies forretention & disposition
Policies forDeletion & redaction
Policies fortrust
Polices forcuration & preservation
-
How iPlant CI Enables DiscoveryOverview of resources
End
Use
rsCo
mpu
tatio
nal U
sers
XSEDE
Storage Computation Hosting Web Services Scalability
Building a platformthat can support diverse and constantly evolving needs.
-
Fig 3. iPlant Discovery Environment, showing the Data, App, and Analyses windows.
Merchant N, Lyons E, Goff S, Vaughn M, Ware D, et al. (2016) The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences. PLOS Biology 14(1): e1002342. https://doi.org/10.1371/journal.pbio.1002342http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002342
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002342
-
How iPlant CI Enables DiscoveryWhat iPlant data solutions mean for a bovine breeder
“It's kind of like being in that COPD commercial where the weight is lifted off your chest, only in our case, we have access to more computational power, so we can get to projects much faster and we can do big projects that our machines may not have allowed us to do previously!
The ability to transport 2TB of data overnight using the iRODS system was particularly helpful because previously, we had been mailing hard drives which is not an optimal solution to sharing big data.”
James Koltes ,Iowa State
-
DataBridgehttp://databridge.web.unc.edu/
VISION:Data as a Citizen – Empower Data
IDEAS:Build Social Networks for
Long-tail Science Data Detect Communities through
Multi-dimensional Socio-metric Analyses
Big Data of the Third Kind:
-
Power law distribution
Data in the Large
Data in the Small
• Dark Data - Hidden, Unknown• But has Presence and Value/worth• Voluminous, Varigated, Unknown Veracity
Long Tail of Data
-
Problems: Long-tail of Science DataFirst Mile Problem• How to make it available?• Where do I upload?• Who is in charge?• How do I get credit?• Can I control access?• How do I pool with other
like-minded researchers? Community services?
• How much is long-term?• Who pays for it?
Last Mile Problem• How to make it findable?• What is needed to make it
more visible? Metadata?• Are there other methods to
make my data findable?• My data has specific ways &
characteristics how do I expose them as finding aids?
• How can I find similar • data?
Data Bridge 35
Solving the long-tail problem will also help other two Big Data problems
blog.enrichconsulting.com
-
Data Bridge: DesignConstruct multi-dimensional social networks for data. Three challenges:• Evaluate multiple types of
“metrics” on data– Domain-specific, genre-specific,
project-specific– Use Socio-metric Network Algorithms– Similar to – but for data
• Find relevance– Slices of similarity– Explore Relationships between
Data, Users, Resources, Methods, Workflows, …– Use Relevance Algorithms
• Create communities– Use Clustering Algorithms
• Provide an extensible & big data framework– Democratize the process Data Bridge 36
-
ScreenShot: Finding similarities
Data Bridge 37
Filter Connectivity by similarity value
Select Network Data
-
ScreenShot: Weight of similarity
38
Similarity measure: 0.5
-
ScreenShot: Highlights of similarities
39
Link to the data
-
Data Bridge Architecture
40
-
Outline• Challenges in Big Data & Data Science
– Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges
– Some Projects at UNC, Chapel Hill• Looking Towards the Future
– Integration of Data, Computing & Networks
-
Two New(inter-related)
Challenging Problem:
Smart CitiesInternet of Things
Big Data of the Fourth Kind
-
What is a smart city?
1. SMART: Urban platforms, Internet, (Big) data manipulation, sensoring and metering
2. SUSTAINABLE: Healthy, Pollution free, Balanced
3. INCLUSIVE: Every one has equal rights and possibilities, no poverty, Social and economic sustainable as well
4. INNOVATIVE: Technologies, creativity, buildings, transport, etc.
5. WELL PLANNED: Urban planning 2.0
SMART CITY
Smart City =Smart Information + Smart Infrastructure + Smart Communication
-
Internet of Things• The Internet of Things “Internet of Objects” “Machine-to-
Machine Era” “Internet of Everything” • The Internet of Things, also called The Internet of Objects, refers
to a wireless network between objects, usually the network will be wireless and selfconfiguring, such as household appliances. ------Wikipedia
• Internet of Things refers to the concept that the Internet is no longer just a global network for people to communicate with one another using computers, but it is also a platform for devices to communicate electronically with the world around them.” --Center for Data and Innovation
-
Our Project: Smart State NC• Promote Sustainable Social and Economic
Developments across the State of North Carolina• Build bridges across Rural and Urban
Communities• Apply Innovative Technologies
– Big Data & Deep Learning– Data Sharing Urban Platforms– Social Media Networks– Internet of Things & Sensor Networks– Cloud Computing & Edge Computing
• North Carolina as an Extreme Living Lab
– Pilot Deployments across Communities
-
NSF SCC Proposal• Building a North Carolina Smart & Connected
Communities Hub – Tackle this growing gap between urban and rural communities.– Adapt and adopt S&CC innovations in data analytics and
technology from urban settings and effectively apply them in rural settings.
– State of North Carolina as a living laboratory and present a series of well-defined research experiments targeted at specific NC communities.
• The goal of each living lab is to apply innovative strategies to solve rural problems and provide a sustainable resource for use by rural communities well beyond the endpoint of the proposed research.
– Through the use of smart technological and social innovations, we hope to (re)connect small rural communities with each other and with urban areas across NC.
• Multiple government agencies and five universities• https://smartcities.web.unc.edu/ 47
-
SmartAnalytics
HardwarePlatform
SoftwareTools
DataNetworks
SmartInfrastructure
SmartAir & Water
SmartEnergy
SmartEnvironment
RuralUrban
National
SmartCommunities
SmartLiving
SmartCitizens
SmartHealth
S&CC Model of Connectedness: Multiple Dimensionalities (triangles), Each with Shared Disciplinary Components (circles)
Data & Information
ScienceComputer
Science
SocialScience
ICT: Information & Communication Technologies
IoT: Internet of Things (sensors)Advanced Data &
Information AnalyticsSocial Media & Networks
SmartMobility
SmartEconomy
SmartGovernance
SmartPolicies
SmartLiteracy
S&CC Model
-
Living Labs
(Health)
Living Labs
(Food)
NCAoT: North Carolina Array of Things Sensor Network
Living Labs
(Air &Water)
Living Labs
(Disaster Recovery)
Living Labs
(Energy)
Living Labs(Pollution)
Mitigate Literacy Gap(Accessibility & Literacy POPs)
Community Engagement(Workshops & Meetings)
NC Smart & Connected Communities Hub: A Socio-Technical System Approach
PSMS: Place-based Social Media Services
SCCConnect: Open Data Platform for Collaboration, Data & Apps
Living Labs
(Mobility)
Cross-cutting ThemesEnhanced Accessibility, Rural Revitalization & Sustainable Solutions
-
Players in NSF SCCI (proposed project)Community Engagement
(Workshops & Town Hall Meetings) Academic Engagement*
Wak
eFo
rest
[?]
Sylv
a[?
]
Tran
sylv
ania
Coun
ty[?
]
Kins
ton
[?]
Jord
anLa
ke,L
ake
How
ell
Pitt
Coun
ty[?
]
Spin
dale
[?]
Salis
bury
[?]
Regi
onA
(SW
C)[?
]
Regi
on G
(PTR
C) [?
]
Regi
onC
(IPDC
)[?]
Living Lab Areas for Rural Revitalization & Sustainable Solutions
Smar
t En
viro
nmen
ts
Disaster Recovery X ECU, UNC-CH
Energy X X WCU, UNC-C
Food Production X NCSU
Water Management X X UNC-C, UNC-CH
Pollution Monitoring X X X UNC-C, UNC-CH
Common NCSCCHub Infrastructure
Smar
t Po
licie
s
Open Data Platforms(SCCConnect)
X X X X UNC-CH
Sensor Networks(NCAoT)
X X X X UNC-C, UNC-CH
Smar
t Ci
tizen
s
Accessibility/ Literacy PoPs
X X UNC-CH
Place-based Social Media Systems
X X X X UNC-C
*ECU = East Carolina University; NCSU = North Carolina State University; UNC-C = UNC – Charlotte; UNC-CH = UNC – Chapel Hill; WCU = Western Carolina University; SWC = South Western Commission; IPDC = Isothermal Planning and Development Commission; PTRC = Piedmont Triad Regional Council
-
Region SpokesScience Spokes Infrastructure Spokes
Food
Air & Water
Energy
DisasterRecovery
Health
Apalachia
Piedmont
Coastal Plains
Coast
WesternFoothills
SensorApps & DataAnalytics
Open DataPlatform
SocialMediaSystems
CommunityPoPs
Governments
Citizens
Academics Industries&Businesses
NGOs
Community Spokes
CentralHub
Fig 2. NC SCC Hub and Spoke System. The system is multi-tiered with four key hub-and-spokes dimensions. The central hub stitches together all the dimensions and acts as a coordination center for the whole initiative.
S&CC Platform Framework
-
A Second Challenge:Data-Centric Collaborative Research
Confluence of forces reshaping how data is leveraged• Data volumes, network limitations, requirements around security, privacy, regulations, new analysis methods that
rely on unique hardware/software systems, cloud computing models, dizzying array of new tools
Forces are misdirecting researchers from science• Provisioning storage, compute servers, and networking• Setting up and configuring data sharing and transfer tools• Provisioning and customizing complex tools on complex IT and High Performance Computing (HPC) systems• Determining how to secure data and IT systems• Figuring out how to publish data and tools for others to discover and use
Data centric collaborations are requiring too many technologists• data scientists, informaticians, library scientists, computer scientists, HPC experts, IT vendors, and
security/privacy experts
Same challenges exist across scientific domains
-
S
Deep Indexing(DataBridge)
Cloud/HPC Computing(CyVerse Atmosphere)
Elastic Collaboration(RADII)
Federated Data(iRODS, DFC)
Secure Data Spaces
Visualization Interfaces
Approach
Provides an on-demand distributed, private workspace for sharing tools and data
-
Conclusion• Challenges in Big Data & Data Science
– Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges
– Some Projects at UNC, Chapel Hill• Looking Towards the Future
– Integration of Data, Computing & Networks
-
Data Science & Big Data in the Future• Organization• Classification• Ontologies• Metadata• Retrieval• Management• Collection building• Analysis• Information seeking• Knowledge Representation• Human Computer Interaction• Social Skills• Data Citizenship• Information Behavior
• Ethics• Privacy• Security• Information Technology• Transformation, Interpretation• Dissemination• Application• Reference Collections• Information Processing• Data Mining• Information Visualization• Information Network• Policy
The List is the Same for Tomorrow – No Different than Today – But Changes to meet the 5Vs New Methodology, New Way of Thinking, New Processing Paradigms, New Interactions
Slide Number 1OutlineBig Data EveryWhere! Lets Start with an AnalogyData - TodayData - TomorrowCharacteristics of Big DataFour Kinds of Big Data (1)Four Kinds of Big Data (2)Four Kinds of Big Data (3)Four Kinds of Big Data (4)Four Kinds of Big DataBig Data Paradigm Shift Data ScienceOutlineBuilding a Big Data Platform�for The First & Third Kind��Big Data ProblemsBuilding Towards Big DataBig Data of the First Kind:�iRODS Distributed Data ManagementSlide Number 20Ingredients of iRODSSlide Number 22Datanet Federation Consortium�http://datafed.org/DataNet Federation Consortium VisionData Driven Science and EngineeringFederation is Central to DFCThree Federations in DFCPolicies Govern the DFCSlide Number 29Slide Number 30Slide Number 31DataBridge�http://databridge.web.unc.edu/Power law distributionProblems: Long-tail of Science DataData Bridge: DesignScreenShot: Finding similaritiesScreenShot: Weight of similarityScreenShot: Highlights of similaritiesData Bridge ArchitectureOutlineTwo New�(inter-related)�Challenging Problem:� �Smart Cities�Internet of Things��Big Data of the Fourth Kind�Slide Number 44Internet of ThingsOur Project: Smart State NCNSF SCC Proposal S&CC Model of Connectedness: Multiple Dimensionalities (triangles), Each with Shared Disciplinary Components (circles)Slide Number 49Players in NSF SCCI (proposed project)Slide Number 51A Second Challenge:� Data-Centric Collaborative ResearchApproachConclusionData Science & Big Data in the Future