Big Data & Data Science: A Practitioner’s PerspectiveBig Data & Data Science: A Practitioner’s...

53
Big Data & Data Science: A Practitioner’s Perspective Arcot Rajasekar [email protected] The University of North Carolina at Chapel Hill

Transcript of Big Data & Data Science: A Practitioner’s PerspectiveBig Data & Data Science: A Practitioner’s...

  • Big Data & Data Science:A Practitioner’s Perspective

    Arcot [email protected]

    The University of North Carolina at Chapel Hill

  • Outline• Challenges in Big Data & Data Science

    – Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges

    – Some Projects at UNC, Chapel Hill• Looking Towards the Future

    – Integration of Data, Computing & Networks

  • Big Data EveryWhere! • Lot of data collected and analyzed

    – Sensors and Instruments– Large Scientific projects– Web data, e-commerce– Commercial/Financial

    transactions– Social Network data– Medical & Health Information– Smart Cities

  • Lets Start with an Analogy

  • Data - Today

  • Data - Tomorrow

  • Characteristics of Big DataFive Vs -– Volume – Exponential Increase in Size & Count– Velocity – Speed at which Data is Created,

    Processed or Used

    – Variety – Multi-dimensionality, arrangement, format,… – Veracity – Integrity & Fidelity– Value – Worth

    – Findability– Availability

  • Four Kinds of Big Data (1)• Archetypal Big Data

    – Science Projects – LHC, LSST, SCEC, OOI, …– Business/Industry – Genomics, Finance, Pharma,…– Government – NASA, NOAA, NCDC, …

    • Volume –High – large datasets• Velocity – High but predictable

    – Few Sources, Multiple Destinations• Variety – Low – Standardized Formats

    – Few Varieties • Veracity – High Fidelity and Credible

    – High Quality Metadata, Corrected data• Value – High – focused, funded

    – Managed by Professionals • Findability – High – Known distribution sites

    – Set Policies for Sharing and Administration• Availability – High – Published API

    – Curated for Long-term PreservationLight & Visible Data

    8

    infographics.socialnama.comwired.com

    https://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Ynhttps://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Yn

  • Four Kinds of Big Data (2)• Crowd-Sourced Big Data

    – Social Media – Facebook, Twitter, Instagram, …– Recommenders – Yelp, Angie’s List, Groupon– Web Commerce – Amazon, Ebay, Orbitz, eNews

    • Volume –High – small data• Velocity – High and non predictable

    – Multiple Sources/Destinations, Few Concentrations• Variety – High – But well managed

    – Site Specific• Veracity – Mixed - Low to High

    – Crowd Sourced – what do you expect!!• Value – Ephemeral

    – Can be None to High• Findability – High – Advertised & Known

    – Web pages and Apps• Availability – Immediate Interest

    – Long-term Availability is iffy

    Nova-like Data 9

    www.plannedparenthood.org-

    dreamstime.com

    https://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Ynhttps://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Yn

  • Four Kinds of Big Data (3)• Long-tail Big Data

    – Science Projects – small teams and organizations– Personal – Hobbies, Amateur/Citizen Science/Arts– Government – Internal and unpublished

    • Volume – High – small data sets– Highly Distributed , Isolated and Hidden

    • Velocity – Low– Internal usage

    • Variety – High – Too many, No Common Format – Project specific, Idiosyncratic

    • Veracity – Non Credible until proven– No Metadata or Non-standard Metadata– Fidelity and Integrity unknown

    • Value – unknown– Earned by reputation

    • Findability – None – Hidden and not advertised– No Sharing , No Services– No Administration or Management

    • Availability – None – In local, disks and tapes– No Notion of Long-term Preservation

    Dark Data 10

    teradata.com

    Images.frompo.com

    https://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Ynhttps://www.google.com/search?q=flood+of+big+data&biw=1482&bih=680&tbm=isch&tbs=simg:CAQSpgEJqCPX1WVQvLsakQELELCMpwgahwEKOggCEhTMCbAJqxCqEO4SrRCiD7sIlwn6GRogqhY3HvfR2oQf1XPofmqKwvql810y9enxac7QfULbqYkKSQgDEhPEA8kD7Qi-A5sXwQOaF8sDYswDGjD9gr378Igt-BJtdjHbxpQzN2bTBT6t-WtEleqrZ5TxKqrUwJK2K9SlJ0D1kPXLbzUMIeRivrXwF8Yn

  • Four Kinds of Big Data (4)• Sensor Streams

    – Environmental & Geoscience Sensors – Smart Cities & Internet of Things – Ubiquity– Personal – Wearables, Health and Home (Fitbit, iPhones)

    • Volume – High – small packets to HD video– Highly Distributed , Few concentrators

    • Velocity – High and time-critical– Internal usage

    • Variety – High – Too many Types – Sensor/Vendor specific

    • Veracity – Credible (until proven)– Non-standard Metadata

    • Value – unknown (as yet)– New Applications

    • Findability – None – Hidden and not advertised– No Sharing , Closed Services– No Data Management

    • Availability – None After Immediate – Very Low Long-term Preservation

    Dark Data 11

  • Four Kinds of Big DataArchetypal

    Science ProjectsLHC, LSST, SCEC

    Business/IndustryGenomics, Finance

    GovernmentNASA, NOAA, DOE

    Crowd-SourcedSocial Media

    Facebook, Twitter

    RecommendersYelp, Angie, Groupon

    Web CommerceAmazon, Ebay

    Long-tailScience Projects

    Small organizations

    Personal Hobbies, Citizen Science/Arts

    GovernmentInternal and unpublished

    Sensor StreamsInternet of Things

    Appliances, Homes

    Smart CitiesEnergy grids, Transportation

    HealthBiosensors, ER,OR

    Characterization Archetypal Crowd-Sourced Long-tail Sensor Streams

    Volume High High High High

    Velocity High Bursty Low High

    Variety Low High High High

    Veracity High Mixed Low Mixed

    Value High Ephemeral Unknown Huge

    Findability High High None None

    Availability High Short-term None Low

  • Big Data Paradigm ShiftWe need to know more about Data Science because we are in the midst of a paradigm shift: not only we have big data,

    the way we do Science, Research and Business is changing

    • Compute Intensive to Data Intensive

    • Large Actions on Small Amounts of Data toSmall Actions on Large Numbers of Data

    • Move Data to Processing Site (Supercomputer Model, Warehouse Model)Move Process to Data Site (Map-Reduce Model, Federation Model)

    • Function Chaining (Programs) to Service Chaining (Workflows and DataFlows)

    Leading to a Large Paradigm Shift:• Model-based Science /Business (Observe-Hypothesize-Test)

    Data-based Science /Business (Data Mining, Knowledge Discovery)

    Data Science is needed to bridge the future

  • Data Science

    Computer Science

    Information Science

    SocialScienceMath, Logic,Statistics, OR, …

    Domain SciencesEngineering EconomicsMedicalLegal

  • Outline• Challenges in Big Data & Data Science

    – Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges

    – Some Projects at UNC, Chapel Hill• Looking Towards the Future

    – Integration of Data, Computing & Networks

  • Building a Big Data Platformfor The First & Third Kind

  • Big Data Problems• Where is the processing hosted?

    – Distributed server/cloud (eg. Amazon EC2, Microsoft Azure)• Where data is stored?

    – Distributed Storage (eg: Amazon S3)• Where is the programming model?

    – Distributed processing (eg. Google’s Map Reduce - Hadoop)• How data is stored and indexed? (eg. Apache Cassandra)

    – High performance schema free database• What operations are performed on the data?

    – Analytic/Semantic Processing (eg. SAS, KNIME, ..)

    • How to make data available?• How to manage the data?• How to find it?

  • Building Towards Big Data• Storage Resource Broker (SRB) (1996-2006)

    – Massive Data Analysis System (DARPA)• Super computing

    – Distributed Object Computation Testbed (DARPA, USPTO, NARA)• Distributed Data Flow

    – National Partnership for Advanced Computing Infrastructure (NSF)• Science Data and Metadata Management

    – Transcontinental Persistent Archives Prototype (NARA) • Infrastructure Independence

    • Integrated Rule Oriented Data Systems (iRODS) (2005->– Life Time Library (SILS)

    • Personal Digital Library– Carolina Digital Repository (UNC)

    • Institutional Repository– DataNet Federation Consortium (NSF)

    • National-scale Cross Disciplinary Collaboration• Data Bridge CI (2013->

    – Data Bridge (NSF)• Long-tail of Science “Data Communities”

    – Data Bridge for Neuroscience (NSF)– …

  • Big Data of the First Kind:iRODS Distributed Data Management

  • UserWith Client Views & Manages Data

    My DataDisk, Tape, Database, Streams, Filesystem,

    etc.

    The iRODS Data System can install in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection.

    iRODS Shows Unified “Virtual Collection”

    Data in the Cloud

    Disk, Tape, Database, Filesystem, etc.

    User Sees Single “Virtual Collection”with metadata & services

    Partner’s DataRemote Disk, Tape,

    Files, DBs, etc.

  • Ingredients of iRODS• Virtualization

    – Independence from Technology - Namespaces– Third Party Authorization, Authentication, Auditing, Accounting, Mediation

    • Automation thru Policies (Rules & Microservices)– Event-based and periodic actions– Orchestration of Workflows and Provenance Capture – Repeatable Data Science

    • Metadata– Integration of Context– Mediation across Multiple Metadata Systems

    • Federation – Three types of Federation– Link systems/resources together

  • CollectionPurpose

    Completeness

    Correctness

    Consensus

    Defines

    Consistency

    Attribute

    HasFeature

    HasFeature

    HasFeature

    Has

    Defines

    Policy

    Has

    Property Defines ProcedureControls Updates

    Client Action

    Periodic Assessment

    Criteria Policy

    Policy Enforcement

    Point

    Workflow

    Invokes

    HasSubType Isa

    Micro-service

    Chains

    Operation

    Isa

    PersistentState

    Information

    Isa

    Digital Object

    Updates

    Has

    Has

    Replication

    Checksum

    Quota

    Data TypeIsa

    IsaIntegrity

    Isa

    AuthenticityIsa

    Access control

    Isa

    msiGetUserACL

    msiSetDataType

    msiSetQuota

    msiDataObjRepl

    msiSysChksumDataObj

    Isa

    Isa

    Isa

    Isa

    Isa

    DATA_ID DATA_REPL_NUM DATA_CHECKSUM

    Isa Isa Isa

    Policy-based Collection Management

    Isa

    Isa

    HasFeature

    Invokes

  • Datanet Federation Consortiumhttp://datafed.org/

    VISION:Enable Collaboration across Scientific DomainsSupport reproducible data-driven researchBuild a National Scale Data Cyberinfrastructure

    IDEAS:Build a Federated Environment

    Data SharingMetadata –based DiscoveryWorkflow OrchestrationPolicy as basis for collaboration

  • Research Environment - Portals, Applications, Workflows

    DFC Collaboration Environment –Data Grid

    Community Resource – Repository, Catalog

    DataNet Federation Consortium Vision• Enable collaborative research

    – Sharing of data, information, and knowledge• Build national-scale cyber-infrastructure

    – Federation of existing data management systems

    • Support reproducible data-driven research– Encapsulate knowledge in shared workflows

    • Enable student participation in research– Policy-controlled access to “live” data

  • Data Driven Science and Engineering• Collaboration Environments

    – Oceanography – Ocean Observatory Initiative• Archiving of climatic data records from real-time

    sensor data streams, replay of sensor data– Engineering – CIBER-U

    • Engineering Digital Library: curation of civil engineering data, student training materials

    – Hydrology - CUAHSI, …• Automation of hydrology research workflows

    (reproduce, reuse and repurpose)– Plant Biology – iPlant Collaoratory

    • Project data I sharing and integration, virtualized metadata services

    – Social Science – Odum Institute• Survey data and Statistical data processing

    – Cognitive Science – Temporal Dynamics Learning• Inter-team collaboration policies, human data

    Engineering Representation

  • Federation is Central to DFC• DFC exposes three models of Federation

    – Strong Federation• Full and complete protocol-level federation across grids• Seamlessly Move from one grid to another • Used in DFC to federate Science & Engineering grids

    – Weak Federation = peering• One-way DFC to External Micro-services and Workflows• DFC needs to ‘know’ the external protocol - plug-ins & wrappers• Used in DFC

    – To access THREDDS (netCDF), Sensor system, federal data resources– SEAD, DataONE, and CUAHSI-HIS , AWS

    – Asynchronous Federation = loose coupling• Message used to interact with external system• Easy to connect (similar to Data Bridge)• Used in DFC to expose indexing and format conversion services

  • Three Federations in DFC

    Eng

    Mar

    Main

    Hydr

    SoL

    Soc

    Bio

    THREDDSSEAD

    TerraPop

    NASA/ NCCS

    NCDC

    EC2S3

    ARTS/ORB

    DataBook

    Vivo Indexer

    HIVE Ontology 2

    HIVE Ontology 1

    Elastic Search

    Format Indexer

    SOLR

    JENA

    Data Verse

    DataONE

  • Policies Govern the DFC

    Policies for automatingdata management

    StandardsGroups

    InternationalProjects

    AdvisoryCommittee

    Science &Engineering

    Domains

    SustainabilityAnd

    Institutions

    FacilitiesAnd

    Operations

    TechnologyAnd

    Research

    EducationAnd

    OutreachPolicies

    AndStandards

    Policies forpublication& federation

    Policies for IPR & citations

    Policies forprovenance & sustainability

    Policies forcollaboration and reuse

    Policies fortechnology migration

    Policies formetadata extraction

    Policies foranalysis and workflow

    Policies for change management

    Domain-centricpolicies

    Policies forauthentication & authorization

    Policies forarchiving, staging & caching

    Policies forreplication & synchronization

    Policies forretention & disposition

    Policies forDeletion & redaction

    Policies fortrust

    Polices forcuration & preservation

  • How iPlant CI Enables DiscoveryOverview of resources

    End

    Use

    rsCo

    mpu

    tatio

    nal U

    sers

    XSEDE

    Storage Computation Hosting Web Services Scalability

    Building a platformthat can support diverse and constantly evolving needs.

  • Fig 3. iPlant Discovery Environment, showing the Data, App, and Analyses windows.

    Merchant N, Lyons E, Goff S, Vaughn M, Ware D, et al. (2016) The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences. PLOS Biology 14(1): e1002342. https://doi.org/10.1371/journal.pbio.1002342http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002342

    http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002342

  • How iPlant CI Enables DiscoveryWhat iPlant data solutions mean for a bovine breeder

    “It's kind of like being in that COPD commercial where the weight is lifted off your chest, only in our case, we have access to more computational power, so we can get to projects much faster and we can do big projects that our machines may not have allowed us to do previously!

    The ability to transport 2TB of data overnight using the iRODS system was particularly helpful because previously, we had been mailing hard drives which is not an optimal solution to sharing big data.”

    James Koltes ,Iowa State

  • DataBridgehttp://databridge.web.unc.edu/

    VISION:Data as a Citizen – Empower Data

    IDEAS:Build Social Networks for

    Long-tail Science Data Detect Communities through

    Multi-dimensional Socio-metric Analyses

    Big Data of the Third Kind:

  • Power law distribution

    Data in the Large

    Data in the Small

    • Dark Data - Hidden, Unknown• But has Presence and Value/worth• Voluminous, Varigated, Unknown Veracity

    Long Tail of Data

  • Problems: Long-tail of Science DataFirst Mile Problem• How to make it available?• Where do I upload?• Who is in charge?• How do I get credit?• Can I control access?• How do I pool with other

    like-minded researchers? Community services?

    • How much is long-term?• Who pays for it?

    Last Mile Problem• How to make it findable?• What is needed to make it

    more visible? Metadata?• Are there other methods to

    make my data findable?• My data has specific ways &

    characteristics how do I expose them as finding aids?

    • How can I find similar • data?

    Data Bridge 35

    Solving the long-tail problem will also help other two Big Data problems

    blog.enrichconsulting.com

  • Data Bridge: DesignConstruct multi-dimensional social networks for data. Three challenges:• Evaluate multiple types of

    “metrics” on data– Domain-specific, genre-specific,

    project-specific– Use Socio-metric Network Algorithms– Similar to – but for data

    • Find relevance– Slices of similarity– Explore Relationships between

    Data, Users, Resources, Methods, Workflows, …– Use Relevance Algorithms

    • Create communities– Use Clustering Algorithms

    • Provide an extensible & big data framework– Democratize the process Data Bridge 36

  • ScreenShot: Finding similarities

    Data Bridge 37

    Filter Connectivity by similarity value

    Select Network Data

  • ScreenShot: Weight of similarity

    38

    Similarity measure: 0.5

  • ScreenShot: Highlights of similarities

    39

    Link to the data

  • Data Bridge Architecture

    40

  • Outline• Challenges in Big Data & Data Science

    – Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges

    – Some Projects at UNC, Chapel Hill• Looking Towards the Future

    – Integration of Data, Computing & Networks

  • Two New(inter-related)

    Challenging Problem:

    Smart CitiesInternet of Things

    Big Data of the Fourth Kind

  • What is a smart city?

    1. SMART: Urban platforms, Internet, (Big) data manipulation, sensoring and metering

    2. SUSTAINABLE: Healthy, Pollution free, Balanced

    3. INCLUSIVE: Every one has equal rights and possibilities, no poverty, Social and economic sustainable as well

    4. INNOVATIVE: Technologies, creativity, buildings, transport, etc.

    5. WELL PLANNED: Urban planning 2.0

    SMART CITY

    Smart City =Smart Information + Smart Infrastructure + Smart Communication

  • Internet of Things• The Internet of Things “Internet of Objects” “Machine-to-

    Machine Era” “Internet of Everything” • The Internet of Things, also called The Internet of Objects, refers

    to a wireless network between objects, usually the network will be wireless and selfconfiguring, such as household appliances. ------Wikipedia

    • Internet of Things refers to the concept that the Internet is no longer just a global network for people to communicate with one another using computers, but it is also a platform for devices to communicate electronically with the world around them.” --Center for Data and Innovation

  • Our Project: Smart State NC• Promote Sustainable Social and Economic

    Developments across the State of North Carolina• Build bridges across Rural and Urban

    Communities• Apply Innovative Technologies

    – Big Data & Deep Learning– Data Sharing Urban Platforms– Social Media Networks– Internet of Things & Sensor Networks– Cloud Computing & Edge Computing

    • North Carolina as an Extreme Living Lab

    – Pilot Deployments across Communities

  • NSF SCC Proposal• Building a North Carolina Smart & Connected

    Communities Hub – Tackle this growing gap between urban and rural communities.– Adapt and adopt S&CC innovations in data analytics and

    technology from urban settings and effectively apply them in rural settings.

    – State of North Carolina as a living laboratory and present a series of well-defined research experiments targeted at specific NC communities.

    • The goal of each living lab is to apply innovative strategies to solve rural problems and provide a sustainable resource for use by rural communities well beyond the endpoint of the proposed research.

    – Through the use of smart technological and social innovations, we hope to (re)connect small rural communities with each other and with urban areas across NC.

    • Multiple government agencies and five universities• https://smartcities.web.unc.edu/ 47

  • SmartAnalytics

    HardwarePlatform

    SoftwareTools

    DataNetworks

    SmartInfrastructure

    SmartAir & Water

    SmartEnergy

    SmartEnvironment

    RuralUrban

    National

    SmartCommunities

    SmartLiving

    SmartCitizens

    SmartHealth

    S&CC Model of Connectedness: Multiple Dimensionalities (triangles), Each with Shared Disciplinary Components (circles)

    Data & Information

    ScienceComputer

    Science

    SocialScience

    ICT: Information & Communication Technologies

    IoT: Internet of Things (sensors)Advanced Data &

    Information AnalyticsSocial Media & Networks

    SmartMobility

    SmartEconomy

    SmartGovernance

    SmartPolicies

    SmartLiteracy

    S&CC Model

  • Living Labs

    (Health)

    Living Labs

    (Food)

    NCAoT: North Carolina Array of Things Sensor Network

    Living Labs

    (Air &Water)

    Living Labs

    (Disaster Recovery)

    Living Labs

    (Energy)

    Living Labs(Pollution)

    Mitigate Literacy Gap(Accessibility & Literacy POPs)

    Community Engagement(Workshops & Meetings)

    NC Smart & Connected Communities Hub: A Socio-Technical System Approach

    PSMS: Place-based Social Media Services

    SCCConnect: Open Data Platform for Collaboration, Data & Apps

    Living Labs

    (Mobility)

    Cross-cutting ThemesEnhanced Accessibility, Rural Revitalization & Sustainable Solutions

  • Players in NSF SCCI (proposed project)Community Engagement

    (Workshops & Town Hall Meetings) Academic Engagement*

    Wak

    eFo

    rest

    [?]

    Sylv

    a[?

    ]

    Tran

    sylv

    ania

    Coun

    ty[?

    ]

    Kins

    ton

    [?]

    Jord

    anLa

    ke,L

    ake

    How

    ell

    Pitt

    Coun

    ty[?

    ]

    Spin

    dale

    [?]

    Salis

    bury

    [?]

    Regi

    onA

    (SW

    C)[?

    ]

    Regi

    on G

    (PTR

    C) [?

    ]

    Regi

    onC

    (IPDC

    )[?]

    Living Lab Areas for Rural Revitalization & Sustainable Solutions

    Smar

    t En

    viro

    nmen

    ts

    Disaster Recovery X ECU, UNC-CH

    Energy X X WCU, UNC-C

    Food Production X NCSU

    Water Management X X UNC-C, UNC-CH

    Pollution Monitoring X X X UNC-C, UNC-CH

    Common NCSCCHub Infrastructure

    Smar

    t Po

    licie

    s

    Open Data Platforms(SCCConnect)

    X X X X UNC-CH

    Sensor Networks(NCAoT)

    X X X X UNC-C, UNC-CH

    Smar

    t Ci

    tizen

    s

    Accessibility/ Literacy PoPs

    X X UNC-CH

    Place-based Social Media Systems

    X X X X UNC-C

    *ECU = East Carolina University; NCSU = North Carolina State University; UNC-C = UNC – Charlotte; UNC-CH = UNC – Chapel Hill; WCU = Western Carolina University; SWC = South Western Commission; IPDC = Isothermal Planning and Development Commission; PTRC = Piedmont Triad Regional Council

  • Region SpokesScience Spokes Infrastructure Spokes

    Food

    Air & Water

    Energy

    DisasterRecovery

    Health

    Apalachia

    Piedmont

    Coastal Plains

    Coast

    WesternFoothills

    SensorApps & DataAnalytics

    Open DataPlatform

    SocialMediaSystems

    CommunityPoPs

    Governments

    Citizens

    Academics Industries&Businesses

    NGOs

    Community Spokes

    CentralHub

    Fig 2. NC SCC Hub and Spoke System. The system is multi-tiered with four key hub-and-spokes dimensions. The central hub stitches together all the dimensions and acts as a coordination center for the whole initiative.

    S&CC Platform Framework

  • A Second Challenge:Data-Centric Collaborative Research

    Confluence of forces reshaping how data is leveraged• Data volumes, network limitations, requirements around security, privacy, regulations, new analysis methods that

    rely on unique hardware/software systems, cloud computing models, dizzying array of new tools

    Forces are misdirecting researchers from science• Provisioning storage, compute servers, and networking• Setting up and configuring data sharing and transfer tools• Provisioning and customizing complex tools on complex IT and High Performance Computing (HPC) systems• Determining how to secure data and IT systems• Figuring out how to publish data and tools for others to discover and use

    Data centric collaborations are requiring too many technologists• data scientists, informaticians, library scientists, computer scientists, HPC experts, IT vendors, and

    security/privacy experts

    Same challenges exist across scientific domains

  • S

    Deep Indexing(DataBridge)

    Cloud/HPC Computing(CyVerse Atmosphere)

    Elastic Collaboration(RADII)

    Federated Data(iRODS, DFC)

    Secure Data Spaces

    Visualization Interfaces

    Approach

    Provides an on-demand distributed, private workspace for sharing tools and data

  • Conclusion• Challenges in Big Data & Data Science

    – Scientific Data Explosion & Role of Data Science• Gearing to Meet the Challenges

    – Some Projects at UNC, Chapel Hill• Looking Towards the Future

    – Integration of Data, Computing & Networks

  • Data Science & Big Data in the Future• Organization• Classification• Ontologies• Metadata• Retrieval• Management• Collection building• Analysis• Information seeking• Knowledge Representation• Human Computer Interaction• Social Skills• Data Citizenship• Information Behavior

    • Ethics• Privacy• Security• Information Technology• Transformation, Interpretation• Dissemination• Application• Reference Collections• Information Processing• Data Mining• Information Visualization• Information Network• Policy

    The List is the Same for Tomorrow – No Different than Today – But Changes to meet the 5Vs New Methodology, New Way of Thinking, New Processing Paradigms, New Interactions

    Slide Number 1OutlineBig Data EveryWhere! Lets Start with an AnalogyData - TodayData - TomorrowCharacteristics of Big DataFour Kinds of Big Data (1)Four Kinds of Big Data (2)Four Kinds of Big Data (3)Four Kinds of Big Data (4)Four Kinds of Big DataBig Data Paradigm Shift Data ScienceOutlineBuilding a Big Data Platform�for The First & Third Kind��Big Data ProblemsBuilding Towards Big DataBig Data of the First Kind:�iRODS Distributed Data ManagementSlide Number 20Ingredients of iRODSSlide Number 22Datanet Federation Consortium�http://datafed.org/DataNet Federation Consortium VisionData Driven Science and EngineeringFederation is Central to DFCThree Federations in DFCPolicies Govern the DFCSlide Number 29Slide Number 30Slide Number 31DataBridge�http://databridge.web.unc.edu/Power law distributionProblems: Long-tail of Science DataData Bridge: DesignScreenShot: Finding similaritiesScreenShot: Weight of similarityScreenShot: Highlights of similaritiesData Bridge ArchitectureOutlineTwo New�(inter-related)�Challenging Problem:� �Smart Cities�Internet of Things��Big Data of the Fourth Kind�Slide Number 44Internet of ThingsOur Project: Smart State NCNSF SCC Proposal S&CC Model of Connectedness: Multiple Dimensionalities (triangles), Each with Shared Disciplinary Components (circles)Slide Number 49Players in NSF SCCI (proposed project)Slide Number 51A Second Challenge:� Data-Centric Collaborative ResearchApproachConclusionData Science & Big Data in the Future