Characterizing Data and Software for Social Science Research

18
SURVEY OF COMMONALITY WITH OTHER DISCIPLINES WORKSHOP 2 – JULY 25, 2013 INDIANAPOLIS, INDIANA MICAH ALTMAN DIRECTOR OF RESEARCH, MIT LIBRARIES MASSACHUSETTS INSTITUTE OF TECHNOLOGY [email protected] PRIMARY RESEARCH OR PRACTICE AREA(S) • INFORMATION SCIENCE • SOCIAL SCIENCE PREVIOUS EXPERIENCE • DIGITAL LIBRARIES • DIGITAL PRESERVATION • STATISTICAL COMPUTING RELATED WORK PUBLICMAPPING.ORG INFORMATICS.MIT.EDU CONTACT INFORMATION E25-131, 77 MASSACHUSETTS AVE, MIT, CAMBRIDGE, MA, 02139

description

This presentation describes the landscape of data and software use across the social sciences in terms of the abstract dimensions of data and data use. It then examines three use cases. Presentation for DASPOS < https://daspos.crc.nd.edu/index.php/workshops/workshop-2 > Workshop at JCDL.

Transcript of Characterizing Data and Software for Social Science Research

Page 1: Characterizing Data and Software for Social Science Research

SURVEY OF COMMONALITY WITH OTHER DISCIPLINESWORKSHOP 2 – JULY 25, 2013

INDIANAPOLIS, INDIANA

MICAH ALTMANDIRECTOR OF RESEARCH, MIT LIBRARIESMASSACHUSETTS INSTITUTE OF [email protected]

PRIMARY RESEARCH OR PRACTICE AREA(S)• INFORMATION SCIENCE• SOCIAL SCIENCE

PREVIOUS EXPERIENCE• DIGITAL LIBRARIES• DIGITAL PRESERVATION• STATISTICAL COMPUTING

RELATED WORK• PUBLICMAPPING.ORG • INFORMATICS.MIT.EDU

CONTACT INFORMATIONE25-131, 77 MASSACHUSETTS AVE, MIT, CAMBRIDGE, MA, 02139

Page 2: Characterizing Data and Software for Social Science Research

Prepared for

DASPOS WorkshopJCDL 2013

Characterizing Data and Software for Social Science Research

Dr. Micah Altman<[email protected]>

Director of Research, MIT LibrariesNon-Resident Senior Fellow, Brookings Institution

Page 3: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.

Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.

Page 4: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Collaborators & Co-Conspirators

• Jonathan Crabtree, Nancy McGovern• National Digital Stewardship Coordination

Committee & Working Group Chairs• Privacy Tools for Sharing Research Data

Team (Salil Vadhan, P.I.)http://privacytools.seas.harvard.edu/people

• Research Support– Supported in part by NSF grant CNS-1237235– Thanks to the Library of Congress, & the

Massachusetts Institute of Technology.

Page 5: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Related Work• CoData Task Group on Data Citations, 2013 (Forthcoming) Out of Cite, Out of Mind:

The Current State of Practice, Policy, and Technology for the Citation of Data, Co-Data Journal (Special Volume).

• Altman & Jackman, 2012, 19 Ways of Looking at Statistical Software, Journal of Statistical Software• National Digital Stewardship Alliance, 2013, 2014 National Agenda for Digital

Stewardship.• Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D.,

Laevart, C., et al. 201.. Communicating Science and Engineering Data in the Information Age. Computer Science and Telecommunications. National Academies Press

• Altman, M., Rogerson, K., & U, D. (2008). Open Research Questions on Information and Technology in Global and Domestic Politics – Beyond “E-.i, 41(4), 1-8. Retrieved from http://www.journals.cambridge.org/abstract_S104909650824093X

• Altman, Gill & McDonald. 2003. Numerical Issues in Statistical Computing for the Social Scientist

Most reprints available from:informatics.mit.edu

Page 6: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

This Talk

• Landscape(dimensions & attributes)

• Landmarks(sample use cases)

Page 7: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Landscape:Characteristics of Social Science Research Data

Page 8: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Some Characteristics of Research Data

Attribute Type ExamplesData: Structure - Single relation (table)

- Fully relational- Network- Geospatial- Semi-structured (e.g. text)

Data: Attribute Types - Continuous/Discrete- Scale: ratio/interval/ordinal/nominal

Data: Performance Characteristics - Number of observations- Frequency of updates- Dimensionality- Sparsity- Collection heterogeneity

Page 9: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Some Characteristics of Research MeasurementsAttribute Type Examples

Measurement: Unit of Observation - Individuals- Groups- Institutions- Organizations- Interactions

Measurement: Measurement type - Experimental- Observational- Synthetic/computational

Measurement: Performance characteristic - Metadata- Ontology- Quality

Page 10: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Some Characteristics of Research Data UseAttribute Type Examples

Analysis methods - Counting- GLM model family- MLE model family- (Constrained) continuous nonlinear

optimization - Blind global optimization- Discrete optimization - Bayesian Methods (MCMC)- Heuristically/algorithmically defined - Text mining- Clustering- Coding and qualitative analysis- Exploratory Data Analysis

Desired Outputs - Summary scalars- Summary table- Data subset - Static data publication- Static visualization- Dynamic Visualization

Page 11: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Some Characteristics of Use ConstraintsContract Intellectual Property

Access Rights Confidentiality

Copyright

Fair Use

DMCA

Database Rights

Moral Rights

Intellectual Attribution

Trade Secret

Patent

Trademark

Common Rule45 CFR 26

HIPAA

FERPA EU Privacy DirectivePrivacy Torts

(Invasion, Defamation)

Rights of Publicity

Sensitive but Unclassified

Potentially Harmful

(Archeological Sites,

Endangered Species,

Animal Testing, …)

Classified

FOIA

CIPSEA

State Privacy Laws

EAR

State FOI Laws

Journal Replication

Requirements

Funder Open Access

Contract

License

Click-WrapTOU

ITAR

Export Restrictions

NDA

Page 12: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Landmarks(Exemplar Use Cases)

Page 13: Characterizing Data and Software for Social Science Research

Exemplar: Policy Analysis

Data and Software in Social Science Research

Attribute Type Examples

Data: Structure - Single relation (table)

Data: Attribute Types - Continuous/Discrete- Scale: ratio/interval/ordinal

Data: Performance Characteristics

- 10K-100K observation- Monthly/annual updates- Dozens of dimensions/measures

Measurement: Unit of Observation

- Individuals; Organization; Institutions

Measurement: Measurement type

- Observational- Repeated cross-sectional/longitudinal

over decades

Measurement: Performance characteristic

- High quality measurements- Systematic and complete metadata- Controlled ontology- Regular updates & long-term access

Management Constraints - Confidentiality; Public Access

Analysis methods - Counting (contingency tables); GLM Family

Desired Outputs - Summary scalars- Summary table- Static visualization (map)

More Information• Science and Engineering Indicators:

http://www.nsf.gov/statistics/seind12/ • Details of NCSES use case:

Novak et al. 2011• Policy data producer perspectives:

Journal of Official Statistics

Page 14: Characterizing Data and Software for Social Science Research

Exemplar: Media Anthropology Dissertation

Data and Software in Social Science Research

Attribute Type Examples

Data: Structure - audio video- GIS coverage/ GPS trails- Semi structured field notes- Coded qualitative and

quantitative data

Data: Attribute Types - Discrete- Scale: ordinal/nominal

Data: Performance Characteristics - 100’s of observed units- Longitudinal- Dozens of

dimensions/measures- Static after publication

Measurement: Unit of Observation - Individuals; Organizations; Physical environment

Measurement: Measurement type - Observational; Interaction

Measurement: Performance characteristic

- High quality measurements- Systematic and complete

metadata- Emergent coding/ontology

Management Constraints - Confidentiality; social norms

Analysis methods - Counting; Discourse; CAQDA (Qualitative)

- (Future) AI/Machine learning

Desired Outputs - Book- 1-2 hour video / interactive

media synthesis

More Information

• Harvard media anthropology Ph.D. Program:sel.fas.harvard.edu/phd.html

Image Sources: Wikimedia Commons. Pixabay.com, Flickr

Page 15: Characterizing Data and Software for Social Science Research

Exemplar: Social Message Analysis

Data and Software in Social Science Research

Attribute Type Examples

Data: Structure - network

Data: Attribute Types - Continuous/Discrete/- Scale: ratio/interval/ordinal/nominal

Data: Performance Characteristics

- 10M-1B observations- Sample from stream of continuously

updated corpus- Dozens of dimensions/measures

Measurement: Unit of Observation

- Individuals; Interactions

Measurement: Measurement type

- Observational

Measurement: Performance characteristic

- High volume- Complex network structure- Sparsity- Systematic and sparse metadata

Management Constraints - License; Replication

Analysis methods - Bespoke algorithms (clustering); nonlinear optimization; Bayesian methods

Desired Outputs - Summary scalars (model coefficients)- Summary table- Static /interactive visualization

More Information• Grimmer, Justin, and Gary King. "General purpose computer-

assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.

• King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in China allows government criticism but silences collective expression." APSA 2012 Annual Meeting Paper. 2012.

• Lazer, David, et al. "Life in the network: the coming age of computational social science." Science (New York, NY) 323.5915 (2009): 721.

Page 16: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Trends: MoreMore Types of Evidence More CollaborationMore Data

More Publications, More Filters

More Learners

More Open

More Replication

Page 17: Characterizing Data and Software for Social Science Research

Data and Software in Social Science Research

Some Challenges for Long-Term Replication/Access

• “messy” human sensors• Mix of data types, structures, sparsity• Complex constraints: confidentiality, licensing,

NDA’s• Manual/Computer-assisted coding• Niche commercial software (and private

bespoke software) integral to analysis• Very long term longitudinal data/accessibility

requirements

Page 18: Characterizing Data and Software for Social Science Research

Questions?

E-mail: [email protected]: micahaltman.comTwitter: @drmaltman

Data and Software in Social Science Research