Characterizing Data and Software for Social Science Research
-
Upload
micah-altman -
Category
Technology
-
view
1.664 -
download
2
description
Transcript of Characterizing Data and Software for Social Science Research
SURVEY OF COMMONALITY WITH OTHER DISCIPLINESWORKSHOP 2 – JULY 25, 2013
INDIANAPOLIS, INDIANA
MICAH ALTMANDIRECTOR OF RESEARCH, MIT LIBRARIESMASSACHUSETTS INSTITUTE OF [email protected]
PRIMARY RESEARCH OR PRACTICE AREA(S)• INFORMATION SCIENCE• SOCIAL SCIENCE
PREVIOUS EXPERIENCE• DIGITAL LIBRARIES• DIGITAL PRESERVATION• STATISTICAL COMPUTING
RELATED WORK• PUBLICMAPPING.ORG • INFORMATICS.MIT.EDU
CONTACT INFORMATIONE25-131, 77 MASSACHUSETTS AVE, MIT, CAMBRIDGE, MA, 02139
Prepared for
DASPOS WorkshopJCDL 2013
Characterizing Data and Software for Social Science Research
Dr. Micah Altman<[email protected]>
Director of Research, MIT LibrariesNon-Resident Senior Fellow, Brookings Institution
Data and Software in Social Science Research
DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.
Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Data and Software in Social Science Research
Collaborators & Co-Conspirators
• Jonathan Crabtree, Nancy McGovern• National Digital Stewardship Coordination
Committee & Working Group Chairs• Privacy Tools for Sharing Research Data
Team (Salil Vadhan, P.I.)http://privacytools.seas.harvard.edu/people
• Research Support– Supported in part by NSF grant CNS-1237235– Thanks to the Library of Congress, & the
Massachusetts Institute of Technology.
Data and Software in Social Science Research
Related Work• CoData Task Group on Data Citations, 2013 (Forthcoming) Out of Cite, Out of Mind:
The Current State of Practice, Policy, and Technology for the Citation of Data, Co-Data Journal (Special Volume).
• Altman & Jackman, 2012, 19 Ways of Looking at Statistical Software, Journal of Statistical Software• National Digital Stewardship Alliance, 2013, 2014 National Agenda for Digital
Stewardship.• Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D.,
Laevart, C., et al. 201.. Communicating Science and Engineering Data in the Information Age. Computer Science and Telecommunications. National Academies Press
• Altman, M., Rogerson, K., & U, D. (2008). Open Research Questions on Information and Technology in Global and Domestic Politics – Beyond “E-.i, 41(4), 1-8. Retrieved from http://www.journals.cambridge.org/abstract_S104909650824093X
• Altman, Gill & McDonald. 2003. Numerical Issues in Statistical Computing for the Social Scientist
Most reprints available from:informatics.mit.edu
Data and Software in Social Science Research
This Talk
• Landscape(dimensions & attributes)
• Landmarks(sample use cases)
Data and Software in Social Science Research
Landscape:Characteristics of Social Science Research Data
Data and Software in Social Science Research
Some Characteristics of Research Data
Attribute Type ExamplesData: Structure - Single relation (table)
- Fully relational- Network- Geospatial- Semi-structured (e.g. text)
Data: Attribute Types - Continuous/Discrete- Scale: ratio/interval/ordinal/nominal
Data: Performance Characteristics - Number of observations- Frequency of updates- Dimensionality- Sparsity- Collection heterogeneity
Data and Software in Social Science Research
Some Characteristics of Research MeasurementsAttribute Type Examples
Measurement: Unit of Observation - Individuals- Groups- Institutions- Organizations- Interactions
Measurement: Measurement type - Experimental- Observational- Synthetic/computational
Measurement: Performance characteristic - Metadata- Ontology- Quality
Data and Software in Social Science Research
Some Characteristics of Research Data UseAttribute Type Examples
Analysis methods - Counting- GLM model family- MLE model family- (Constrained) continuous nonlinear
optimization - Blind global optimization- Discrete optimization - Bayesian Methods (MCMC)- Heuristically/algorithmically defined - Text mining- Clustering- Coding and qualitative analysis- Exploratory Data Analysis
Desired Outputs - Summary scalars- Summary table- Data subset - Static data publication- Static visualization- Dynamic Visualization
Data and Software in Social Science Research
Some Characteristics of Use ConstraintsContract Intellectual Property
Access Rights Confidentiality
Copyright
Fair Use
DMCA
Database Rights
Moral Rights
Intellectual Attribution
Trade Secret
Patent
Trademark
Common Rule45 CFR 26
HIPAA
FERPA EU Privacy DirectivePrivacy Torts
(Invasion, Defamation)
Rights of Publicity
Sensitive but Unclassified
Potentially Harmful
(Archeological Sites,
Endangered Species,
Animal Testing, …)
Classified
FOIA
CIPSEA
State Privacy Laws
EAR
State FOI Laws
Journal Replication
Requirements
Funder Open Access
Contract
License
Click-WrapTOU
ITAR
Export Restrictions
NDA
Data and Software in Social Science Research
Landmarks(Exemplar Use Cases)
Exemplar: Policy Analysis
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - Single relation (table)
Data: Attribute Types - Continuous/Discrete- Scale: ratio/interval/ordinal
Data: Performance Characteristics
- 10K-100K observation- Monthly/annual updates- Dozens of dimensions/measures
Measurement: Unit of Observation
- Individuals; Organization; Institutions
Measurement: Measurement type
- Observational- Repeated cross-sectional/longitudinal
over decades
Measurement: Performance characteristic
- High quality measurements- Systematic and complete metadata- Controlled ontology- Regular updates & long-term access
Management Constraints - Confidentiality; Public Access
Analysis methods - Counting (contingency tables); GLM Family
Desired Outputs - Summary scalars- Summary table- Static visualization (map)
More Information• Science and Engineering Indicators:
http://www.nsf.gov/statistics/seind12/ • Details of NCSES use case:
Novak et al. 2011• Policy data producer perspectives:
Journal of Official Statistics
Exemplar: Media Anthropology Dissertation
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - audio video- GIS coverage/ GPS trails- Semi structured field notes- Coded qualitative and
quantitative data
Data: Attribute Types - Discrete- Scale: ordinal/nominal
Data: Performance Characteristics - 100’s of observed units- Longitudinal- Dozens of
dimensions/measures- Static after publication
Measurement: Unit of Observation - Individuals; Organizations; Physical environment
Measurement: Measurement type - Observational; Interaction
Measurement: Performance characteristic
- High quality measurements- Systematic and complete
metadata- Emergent coding/ontology
Management Constraints - Confidentiality; social norms
Analysis methods - Counting; Discourse; CAQDA (Qualitative)
- (Future) AI/Machine learning
Desired Outputs - Book- 1-2 hour video / interactive
media synthesis
More Information
• Harvard media anthropology Ph.D. Program:sel.fas.harvard.edu/phd.html
Image Sources: Wikimedia Commons. Pixabay.com, Flickr
Exemplar: Social Message Analysis
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - network
Data: Attribute Types - Continuous/Discrete/- Scale: ratio/interval/ordinal/nominal
Data: Performance Characteristics
- 10M-1B observations- Sample from stream of continuously
updated corpus- Dozens of dimensions/measures
Measurement: Unit of Observation
- Individuals; Interactions
Measurement: Measurement type
- Observational
Measurement: Performance characteristic
- High volume- Complex network structure- Sparsity- Systematic and sparse metadata
Management Constraints - License; Replication
Analysis methods - Bespoke algorithms (clustering); nonlinear optimization; Bayesian methods
Desired Outputs - Summary scalars (model coefficients)- Summary table- Static /interactive visualization
More Information• Grimmer, Justin, and Gary King. "General purpose computer-
assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.
• King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in China allows government criticism but silences collective expression." APSA 2012 Annual Meeting Paper. 2012.
• Lazer, David, et al. "Life in the network: the coming age of computational social science." Science (New York, NY) 323.5915 (2009): 721.
Data and Software in Social Science Research
Trends: MoreMore Types of Evidence More CollaborationMore Data
More Publications, More Filters
More Learners
More Open
More Replication
Data and Software in Social Science Research
Some Challenges for Long-Term Replication/Access
• “messy” human sensors• Mix of data types, structures, sparsity• Complex constraints: confidentiality, licensing,
NDA’s• Manual/Computer-assisted coding• Niche commercial software (and private
bespoke software) integral to analysis• Very long term longitudinal data/accessibility
requirements
Questions?
E-mail: [email protected]: micahaltman.comTwitter: @drmaltman
Data and Software in Social Science Research