Post on 23-Oct-2021
WELCOME
Brian Harris-KojetinCommittee on National Statistics, Director
Brian MoyerBureau of Economic Analysis, Director
Big Data Day Executive Sponsor
CNSTAT PanelBob Groves, Georgetown University, CNSTAT Chair
Sallie Keller, Virginia Tech University
Jerry Reiter, Duke University
Scanner Data in the Economic Research Service Consumer Food Data Program
Megan SweitzerUSDA Economic Research Service
Big Data DayMay 11, 2018
Proprietary Scanner Data
• Consumer purchase transactions
• Used for marketing research
• Household purchase survey
• Retail point-of-sale (POS) data
– Purchase transaction records collected
from store POS systems of 60,000 stores
– Each year of data contains 6.5 billions records
• Used for research projects, program evaluations, regulatory
impact analyses, data products
Challenges
Accommodating size of data• About 5 TB of data over 9 years
Extensive effort to clean, organize, and document • Checking completeness and accuracy• Data format and organization• Understanding components and variables
Using data appropriately• Designing suitable studies given properties of data• Interpreting results in appropriate context
Understanding and Evaluating Data Quality
Documentation
Statistical properties
Coverage
Representativeness
Filling Gaps in the Data
Developing weights for the retail stores
Imputing missing prices in household data
Linking stores in the household and retail data
Acquiring new data from the vendor
• SNAP and WIC variables
• Less-restricted data
Creating Linkages with Other Datasets
USDA Nutrient and Food Composition Databases
• Linking foods as purchased to foods as consumed
• Linking food prices and dietary recall data USDA’s National Household
Food Acquisition and Purchase Survey (FoodAPS)
• Product identification
• Food environment
Thank you!
Megan SweitzerERS Food Economics Division
megan.sweitzer@ers.usda.govhttps://www.ers.usda.gov/
“The Art of the Possible”Census Enterprise Data Lake Proof-of-Concept
Big Data Day Presentation
Nitin Naik & Adley Kloth
Chief Technology Office
IT Directorate
May 11, 2018
Topics
Census Survey Operation As-Is Today
Proposed Solution
Proof-of-Concept
Census Survey Operation Future Target State
Census As-Is today Decentralized Data Management
• Multiple copies/instances
• Decentralized data stewardship with no Master Metadata Model at Directorate or
Enterprise level
• Difficult to share or link the data or even metadata between project or research teams
or Directorates.
Processing and Analytic Logic Constraints
• Processing data using coding intensive methods in SAS or Oracle, and in
numerous systems with questionable documentation
• Proper curation and storage of the data processing code is limited
• Reproducibility of results is very limited
Decentralized Data Storage
• Data not stored centrally for access.
• Most datasets are stored as SAS files or Oracle DBs
• Most data have file level access control
• Current data handling process non-scalable for administrative and 3rd party datasets
required to improve accuracy and quality of data products.
Decentralized security and privacy implementation
• Creating severe system inefficiencies.
• Cumbersome governance and security measures make tracking disclosure of Title 13
and Title 26 data difficult
• Limited Auditing and Usage monitoring
Technology Constraints
• Multiple copies stored on different servers/systems due to silo-ed technology
deployments based project funding
• Inability to handle large datasets with complex calculations
• Lack of access to software and tools for deeper data analysis
DEMOECON
S1 S2 S3 S4…
.M1
M2
M3….
Y1
Y2
Yn
…
.
….
Survey Portfolio
Tim
e P
eri
od
Census Data Limitation
Se
cu
rity
Go
ve
rna
nce
Infrastructure
Management
Data
Management
Analytics
Se
cu
rity
Go
ve
rna
nce
Infrastructure
Management
Data
Management
Analytics
Se
cu
rity
Go
ve
rna
nce
Infrastructure
Management
Data
Management
Analytics
Survey N
DEMOGRAPHICS
DECENNIAL
OTHER
PROGRAMS
Survey N + 1 Survey N + 1…
Economics
Enterprise Data Lake
Data
Se
rvic
es Machine Learning
Algorithms
Hyper Search Cognitive Learning API’s
Meta Tagging Micro Services
Mass Parallel Processing Text Analytics
In-Memory Analytics
Legacy
Data Application
Data
Streaming
DataRDBMS
Data
Unstructured
Data
(images, videos,
BLOB’s) Big Data
Stores
Flat Files
(csv, .txt)Social Media
External
Data
(Address, Geo,
Weather) SAS
Datasets
artificial
intelligencecognitive
computing
Data Linkage
Life
cycle Response
Collection
Imputation &
Estimation
Disclosure
AvoidanceTabulation Publication
Production
Application
s
Analytics &
Reporting
Internal
User Portal
Mobile
Apps
Census
WebsiteDashboards
Artificial
Intelligence
Systems
Cognitive
Computing
Systems
16
Proposed Solution: Enterprise Data Lake in the Cloud
Security & Privacy
Enterprise Data Lake Proof-of-Concept (POC) Strategy
Team
Problem
How can we support migration of legacy datasets, data treatment
and analytics SAS code in the Cloud and Big Data platforms
Operation Mode8 Week Effort with Weekly Sprints
Econ
Demo
CTO
Enterprise Data Lake POC ScopeWhat Does “Success” Look Like?
• Data Lake Web User Interface
• Data access controls
• Pluggable Analytic pipelines (EMR, HortonWorks, etc)
• “App store”
• 2012 Commodity Flow Survey
• Survey of Income and Program Participation
• Legacy SAS Code
• Accuracy and Performance
• Code Handoff
• Training and Documentation
• Roadmap
Build Run Transfer
User initiates
Analysis Routine
based on selected
data
1
Enterprise Data Lake POC Execution StepsHow did we do It?
Custom Java routine
creates Accumulo
rights and data tables
and loads the data
4
Hive tables are
created based on
data visible to user in
Accumulo
5
A SAS AMI is
launched with Hive
connection details
6
A SAS Program is run
and results are stored in
S3. The AWS instances
and services are
terminated
7
1) Location of Results
2) Location of Logs
A NodeJS Lambda
function launches
EMR/HDX via
SDK/API
2 1) Analysis Routine
2) Data File
3) AD Group
A Hadoop cluster is
launched and
bootstrapped to install
Accumulo and Hive
3
Automation using same standard images
for OS, HDP, Accumulo, Hive, SAS and R
were used for CSF and SIPP
User Views Results
A
A
A
A
A
A
A
8
Enterprise Data Lake POC Outcomes
• Detail research and recommendations for Hortonworks™ running in AWS
• Detail research and recommendations for SAS™ running in AWS
• Demonstrate a web-accessible data lake that provides:User authentication and authorization
Control of access roles and rights on a survey
In-context launch of SAS and SQL that supports:Column level access control based on LDAP Group
One routine executed against one or more data files
• Replicate output of existing DEMOGRAPHIC and ECONOMIC data and procedures that:Matches results of existing analysis routines
Completes faster than current capabilities
Accomplishments
Census Future Target State Centralized Data Management
• Only one instance of data
• Data lineage available
• Can share or link the data and/or metadata between project or research
teams and Directorates
Processing and Analytic Logic Constraints
• Processing data using coding intensive methods in SAS or Oracle in one
environment
• Proper curation and storage of the data processing code is fully visible
• Reproducibility of results is fully capable
Centralized Data Storage
• Data stored centrally for access with access control at file and
row/column/cell level
• Data handling scalable for administrative and 3rd party datasets.
Centralized security and privacy implementation
• Easier governance and security measures
• Easier Tracking disclosure of Title 13 and Title 26 data
• Expanded Auditing and Usage monitoring
Technology Capabilities
• Automated Provisioning of Security Certified Standardized PaaS
• Tight Integration of Storage and Compute for Faster Analysis
• Ability to handle large datasets with complex calculations
• Use of new tools (R, Python, Graph, Spark, Solr, mahout, Palo)
• Cost Chargeback based on utilization
Enterprise
Directorate
Analytics
Directorate
Analytics
Directorate
Analytics
EDL Standard Services
Standardized Cloud Services
Standardized Census Data Services
Go
vern
ance
Secu
rity
Flat
Files
Survey
dataAdRecsSAS
Dataset
Enterprise Data Lake
Analytical
extracts
RDBMS Datasets
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
Data as a Service (DaaS)
Census Future State: To-Be Vision
Theme: Any Data
Many Tools
Easier Operation
Technologies:
=Better Analytics
Quality Products
Faster Time-to-Publish
Feedback and Q&A
Thank You!
RAAS SENSITIVE BUT UNCLASSIFIED
WORKING DRAFT – for research purposes only
Big Data Day –Recommendation Systems at the IRS5/11/2018
WORKING DRAFT – for research purposes only
| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)
SENSITIVE BUT UNCLASSIFIED
May 2018
Challenges in form value verification
Time-consuming and
costly
• > 300 employee-
weeks away from
regular duties
• Employees who
perform this task are
typically very
experienced
Rushed
• High volume of forms
requires reviewers to
minimize time spent on
each form
IRS processes millions of forms that require an assessment of accuracy of form values.
In an early stage of the process, experienced employees manually evaluate forms to flag potentially inaccurate values.
Inconsistent
• While standard
operating procedures
exist, adherence varies
widely
• Furthermore, the
manual process is
complicated and
unwieldy
Challenges with Current Process
WORKING DRAFT – for research purposes only
| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)
SENSITIVE BUT UNCLASSIFIED
May 2018
Avatar Citizen Kane Mean GirlsJurassic
Park
Stephanie 2 5
Andy 3 1
Ryan 5 4
Caitlin 4
Ashley 2 3
Known
• Netflix initially had limited selection of popular movies and wanted to promote lesser-known movies
• Sponsored a $1M competition to estimate customer movie ratings• Dataset involved >17,000 movies for almost 500,000 users• Required estimation in a sparse high-dimensional space
• Recommendation problem can be posed as a problem in sparse matrix estimation
• Competition won by a team employing a variety of algorithms combined in an ensemble approach
3.1 4.3
3.5 2.8
4.8 4.1
2.1 3.6 3.2
4.2 2.6
Estimated
Netflix Movie Rating Challenge
Recommendation algorithm research was spurred by the Netflix movie
recommendation challenge
WORKING DRAFT – for research purposes only
| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)
SENSITIVE BUT UNCLASSIFIED
May 2018
Recommendation system at the IRS
Form line item Form 1 Form 2 Form 3
Line item 1 125,000 976,789
Line item 2 -98,761
Line item 3 10,000 200,000
Line item 4 95,657
Line item 5 67,932
Line item 6 9,657 3,400
Line item 7 45,000
Line item 8 23,000
Line item 9 89,000
Line item 10 34,000 25,341
Line item 11 9,521
Line item 12 34,567
Form-level score
(120,000) (890,341)
(-94,081)
(9,000) (189,591)
(10,000)
(65,329)
(8,492) (4,021)
(4,254)
(24,912)
(29,301)
(75,025) (27,124)
(9,964)
(39,231)
Identifying non-compliant returns and their component Issues
Sparse data algorithms forecast all line items on forma
Flexible approach allows for scoring attachments –even rarely used forms or line items
The differences between line item estimates and actuals can be aggregated across a single form to derive a form-level score
Actual value (forecast value) Possible inaccurate
value
0.76 0.39 0.44
Line-item forecasts are obtained without using any training data
WORKING DRAFT – for research purposes only
| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)
SENSITIVE BUT UNCLASSIFIED
May 2018
An individual return’s line items are estimated using a baseline model trained on
millions of forms
Baseline
model
μ, Σ
Baseline
model
μ, Σ
Form line item Actuals ($)
Line item 1 125,000
Line item 2
Line item 3 95,657
Line item 4 10,000
Line item 5
Line item 6 45,000
Line item 7
Line item 8
Line item 9 34,676
Form line
item
Forecast
($)
Line item 1 324,720
Line item 2
Line item 3 94,154
Line item 4 10,175
Line item 5
Line item 6 42,500
Line item 7
Line item 8
Line item 9 39,231
Testing: Given an individual form, use model to forecast that form’s line items, and compare
Attachment 2
Main Form
Attachment 1
4.2MM forms
Training: Estimate baseline model using millions of concatenated forms
Attachment 3
Corrections
313,378 Correction 1
93,234
10,456
42,100
38,123
> 5
00
varia
ble
s
Performance is
measured by
comparison to
actual
corrections
WORKING DRAFT – for research purposes only
| RESEARCH, APPLIED ANALYTICS AND STATISTICS (RAAS)
SENSITIVE BUT UNCLASSIFIED
May 2018
On classifier-selected returns, the recommendation system outperforms classifiers in
issue identification
Source: XRDB2, 1120S MeF 1M Randomly Sampled Returns from TY13-15 subsetted to returns in CDE
63%
21%
-43%-50%
-40%
-30%
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
On-Target Capture
Perc
en
tag
e c
han
ge
The top model improves upon classifier performance on several
fronts
Misdirection
63% increase in on-target
predictions – the portion of
anomalies flagged that resulted in a
correction
21% increase in capture rate – the
portion of all corrections made that
were also identified in classification
43% decrease in misdirection – the
number of flagged anomalies that did
not result in a correction for every
flagged anomaly that did result in a
correction
Metrics
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
A Big Data Concept by the Bureau of Labor Statistics
MEASURING FOREIGN DIRECT INVESTMENT’S
IMPACT ON THE LABOR MARKET
Erik FriesenhahnEconomist, Bureau of Labor Statistics
May 11, 2018
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
Overview
Collaborative effort between the Bureau of Economic
Analysis (BEA) and the Bureau of Labor Statistics (BLS)
Add value to existing data
• BEA employment data
– national and state
– some industry detail
• BLS can publish:
– employment and wage data with greater industry and geographic detail
– occupational data
BLS last published limited FDI data in 1994
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
What data sources did we use?
BEA Foreign Direct Investment in the United
States
BLS Quarterly Census of Employment and
Wages (QCEW)
BLS Occupational Employment Statistics (OES)
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
Bureau of Economic Analysis (BEA)
2012 benchmark Survey of Foreign Direct Investment in the United States• 10% or greater foreign ownership
• data collected based upon a company’s fiscal year
Affiliate level data• often composed of many establishments
Publish data on:• balance sheets, plant and equipment, income, value added, goods
and services provided, employment
• no wage data
• no occupational data
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
Quarterly Census of Employment and
Wages (QCEW)
Nearly complete quarterly census of businesses
• 98% of all nonfarm employment
• both private and public
• unique in frequency and timeliness
Establishment level data
Over 200 variables:
• monthly employment
• quarterly wages
• industry classification
• company name
• address and phone number
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
Occupational Employment Statistics (OES)
Sample units drawn from the QCEW
• 1.2 million establishments
Classification system
• 23 major occupational groups
• over 800 detailed occupations
Publish occupational employment and wage data by:
• industry classification
• geographic detail
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
How are we creating our “big data” product?
1. Auto-match BEA data to the QCEW• matching variable: Employer Identification Number (EIN)
• initial match done with computer algorithm
2. Analyst review of matches• internet search
– verify auto-matched information
– locate additional subsidiaries
• longitudinal database– company name
– address
– telephone number
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
A closer look at EINs
Firms can have more than one EIN
Firms may use different EINs for different
purposes
Neither BEA nor BLS collects full list of EINs
Incomplete matching of EINs will lead to
employment differences
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
What are some other factors that will cause differences between BEA and BLS?
Different people filling out the forms
Timing issues
• reference period
• seasonal fluctuations
• rapidly growing/contracting companies
All in/all out
• foreign ownership may be for only a specific establishment
Geographical scope
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
What data will BLS publish?
QCEW
• industry detail
– total private down to NAICS 4-digit
• geographic detail
– national, state, MSA and county
• country of ownership
OES
• major occupation groups (national and state)
• detailed occupations (national)
• ownership by world region
Contact Information
— U.S. BUREAU OF LABOR STATISTICS • bls.gov
Erik FriesenhahnEconomist
Business Employment Dynamicswww.bls.gov/bdm
202-691-6557friesenhahn.erik@bls.gov
www.eia.govU.S. Energy Information Administration
Independent Statistics & Analysis
Innovative Uses of Administrative and Third Party Data
For
ICSP Big Data Day
May 11, 2018 | Washington, D.C.
By
Nanda Srinivasan and Kevin Cooksey (Bureau of Labor Statistics)
Overview• Description of EIA’s motor gasoline survey and petroleum marketing frame
and framing the context
• Description of BLS QCEW – framing the context, use of the data – 1.5
minute
• Challenges faced in drafting the MOU – description of safeguards on both
ends, including CIPSEA
• Alignment of management and drafting of MOU – Actions taken at both ends
• Results for EIA and BLS
• Closing thoughts
Nanda Srinivasan, Reston, VA
May 17, 2018
Nanda Srinivasan
May 11, 2018
Motivation
• Responsibility of federal statistical agencies to investigate alternative
sources of data to:
– Increase time and cost efficiencies
– Reduce respondent burden
• Larger context of research priorities for federal statistical agencies
– CNSTAT reports on Multiple Data Sources
– Commission on Evidence Based Policy Making
• EIA – Internal Statistical Methods Improvement Plan
• BLS – Provides good statistical use case for QCEW
Nanda Srinivasan, Reston, VA
May 17, 2018
4
7
Nanda Srinivasan
May 11, 2018
Data Source: Motor Gasoline Price Survey (EIA-878)
• Weekly mandatory CIPSEA survey of approximately 800
retail gasoline stations across the country.
• “Gasoline price” definition: Cash price per gallon
(including taxes) as of 8:00 a.m. local time each Monday
– Regular, midgrade, and premium gasoline.
• Mode: Mostly CATI; however, other modes also available
• Same day data collection, processing, and dissemination
• Estimates are produced for 276 publication cells
– Nation, regions, 10 cities, 9 states
– Regular, midgrade, and premium
– Conventional and reformulated gasoline
Nanda Srinivasan, Reston, VA
May 17, 2018
Nanda Srinivasan
May 11, 2018
Nanda Srinivasan
May 11, 2018
EIA-863 – Petroleum Product Sales Identification Survey
• A triennial census of petroleum product volumes sold annually within the 50
states and DC.
– Products: No. 2 Distillate Fuel Oil/Diesel; No. 5 & 6 Residual Fuel; Motor Gasoline and
Gasohol; Propane.
– Used to benchmark and weight other EIA surveys.
– Has not been conducted in a number of years.
• Challenges
– Need to identify all unique business in a diversified product market. No longer able to use
SIC codes to identify firms by product sold.
– Need to identify the geography of a firm’s market area. Who’s selling what across what state
lines?
Nanda Srinivasan
May 11, 2018
EIA-863 Multiple Data Source Solution
• U.S. Bureau of Labor Statistics Quarterly Census of Earnings and
Wages(QCEW)
– Frame Covers 40 states with firms by NAICS.
– Most comprehensive source but has holes.
• State Energy Office Lists
– Available for a few states. Limited geographic coverage.
• Web Crawling/Scrapping Data Grab
– EIA has engaged Idaho National Lab’s Big Data Team to develop a web crawling and
scrapping process for the 10 states not in QCEW.
– Process uses a trained classifier with set key words to identify websites that are then
scrapped for firm names and contact information.
Nanda Srinivasan
May 11, 2018
Data Sharing Takes Patience and Time; Results are “Big”
• Cost savings
• Institutional issues
– State buy in for QCEW
– CIPSEA requirements
– Getting data to flow across two agencies
• Support from management
• BLS staff on Employment and Unemployment Statistics
– Dave Talon and Kevin Cooksey
• EIA staff
– Jeramiah Yeksavich and Maura Bardos
Nanda Srinivasan
May 11, 2018
“Big Data” for Health Care Through Use of Electronic Health Records
(EHRs)
Carol DeFrances, Ph.D.
Chief, Ambulatory and Hospital Care Statistics Branch
Division of Health Care Statistics
National Center for Health Statistics
Big Data Day
May 11, 2018
Division of Health Care Statistics
National Center for Health Statistics
Why Move the National Health Care Surveys to EHR Data?
• Less burden on the provider--no need for on-site medical record abstraction.
• More clinical detail and depth--all diagnoses, medications, and lab results are collected.
• Greater volume of data--all visits are included.
• Richer data available—allergies to medications, problem lists, family history, social history, and use of alcohol, tobacco, and substances.
• Better security--direct transmission of data with no need for laptops.
What Steps Were Taken to Move to EHR Data Collection?
Research
• Conducted several pilot studies sponsored by the Assistant Secretary for Planning and Evaluation, DHHS.
Data Standards
• Developed HL7 CDA Implementation Guide (IG) for the National Health Care Surveys, which provides a standardized format for data submission.
Survey Incentives
• Participation fulfills requirements of Medicare and Medicaid EHR Incentive Programs, a.k.a. Meaningful Use (MU).
• IG named in 2015 edition of Health IT Certification Criteria.
What are the Opportunities with EHR Data?
Greater Clinical Depth and Richness• Collect clinical information objectively without need for medical
record abstraction.• Include all diagnoses, procedures, active problems, medications,
laboratory tests, imaging, and results.
More Volume• Include all inpatient and ambulatory visits, not just a sample.• Collect rare conditions and experimental procedures.
Ability to Link Across Hospital Settings and to Other Data Sources
• Follow patients as they receive care throughout the sampled hospital—in the ED, as an inpatient, including ICU care, as well as any follow-up care received in outpatient clinics at the hospital.
• Link to the National Death Index (30-, 60-, and 90-day mortality).• Link to Medicare and Medicaid Claims.
What are the Challenges with EHR Data?
Technological• How and where to store the large volume of data collected.
• Dealing with interoperability issues.Assessing Quality
• Need to conduct comparability studies.Analytical• How to integrate and harmonize: EHR data and abstracted data, and EHR data and administrative claims data.
• Should EHR data be edited?Disclosure Concerns
• Are public use files still possible?
What is the Impact of EHR Data for the National Health Care Surveys?
“Big data” for health care
Innovative Research to Improve the Nation’s Health Care
A Dive into U.S. Expenditures on
Treatment by Disease, 2000-2014
Office of the Chief Economist Health Team, May 2018
Trend in Health and Non-health Personal Consumption Expenditures 1970-2016
8/17/2018
62
Health: 21% of Consumption in
2016
0
2000
4000
6000
8000
10000
12000
Bill
ion
s $
Year
Contribution of our work
Redefines output for the health sector to be the treatment of a condition
For example
– Output = number of patients treated for heart attacks
– Expenditures = spending on the treatment of heart attacks
– Price = average spending per treated patient for attacks
8/17/2018
63
Health Care Satellite Account
8/17/2018
64
18 Aggregate Chapters
Health Care Satellite Account - Limitation
• Broad disease chapters limits applicability of the account
• “Nervous system” chapter includes:
– Migraines
– Multiple Sclerosis
8/17/2018
65
18 Aggregate Chapters
Health Care Satellite Account - Next release planned for this summer includes additional detail
8/17/2018
66
63 ComponentCategories18 Aggregate
Chapters
261 Detail Conditions
Health Care Satellite Account - Next release planned for this summer includes additional detail
8/17/2018
67
63 ComponentCategories18 Aggregate
Chapters
261 Detail Conditions
Additional detail at two levels of aggregation
8/17/2018
68
Aggregate Chapter
Component Categories
Detail Conditions
Construction of Blended Account
•Use population weights from Medical Expenditure Panel Survey to fold in data from different sources
69
Millions of Privately Insured MarketScan®
Millions of Medicare
Enrollees Medicare FFS 5%
Sample
MEPS Other (e.g. Uninsured, Medicaid)
Data comparison of spending per capita by condition
70
Growth Rates for Top-30 Conditions VS. All Others
Expenditures driven by technologies
8/17/2018
72
Sovaldi(Sofosbuvir,)
2013
Expensive biologics
Conclusion
•Big data used to produce detailed condition-level estimates
•Our analysis of this data shows innovative treatments driving expenditures higher
•We hope the availability of more detailed condition-level estimates will lead to other new insights, questions, and future improvements in BEA estimates
8/17/2018
73
Health Care Satellite Account - Next release planned for this summer includes additional detail
8/17/2018
74
63 ComponentCategories18 Aggregate
Chapters
261 Detail Conditions
Carol A. Robbins
ICSP Big Data Day
May 11, 2018
National Science Foundation
National Center for Science and Engineering Statistics
www.nsf.gov/statistics/
New Opportunities to Observe and Measure
Innovation, Modeling, Infrastructure, and Standards:What Can Big Data Tell us About Open Source Software?
• NCSES
• Surveys on R&D, Innovation, S&E education and S&E workforce
• Congressionally Mandated Reports and other statistical products
• VT Social and Decision Analytics Lab: Data Science
• Collaboration with:
• Gizem Korkmaz, Stephanie Shipp, and Sallie Keller of VT SDAL
• Claire Kelling, Penn State University
•Improve Indicators of Research Outputs and Innovation Activities
• Open Source Software
• Public investment, limited output measures, potentially large
impact
• Intangible investment and an innovation activity
NCSES Collaboration with Virginia Tech
• Server Software
•Operating Systems
•Statistical Software
Software is Everywhere:
Some of it is both Free and Customizable
ProprietaryOpen Source
Opportunity
to Harvest Data
Harvesting Open Source Software Data
Open Source Software Data InventoryVariables Source
Forge
GitHub depsy Open
Hub
Potential Uses
Downloads X X Measure potential impact
Ratings X Measure impact and sentiment
Release and
update date
X X Identify completed projects
and current activity
Citations X Measure impact
Reuse and
dependencies
X Measure impact across
projects
Type of software X Identify product
Lines of code X X Estimate effort
Contributor
characteristics
X X X X Contributor network, team,
experience, and sectors
Contribution level X X X Estimate effort
Open Source Software: Measuring Value
and Impact with R Statistical analysis
packages (green)
Data
wrangling,
exploration,
and
visualization
(pink)
Web-base data/API processing
(turquoise)
Packages for matrix
operations (blue)
Packages
Match depsy to
CRAN R list
Average
For R
ggplot2
n = 9,801
Downloads per
package
58,000 6,255,500
citations per
package
6.83 1,307
Downloads and Citations
Spatial
Analysis
(orange)
Next Step: Estimate Cost for R Packages
• Cost Components for packages
Effort in person months
Contributor experience and contribution level
Wage equivalent
computer programmers, software developers
Occupation Employment Survey, Bureau of Labor Statistics
Package cost = sum(total_person_year * wage_year)
• Industry Model: Constructive Cost Estimation
Effort is a function of complexity and lines of code
KLOC = kilo (thousands) of lines of code
• Compare with aggregate software investment measures
• Extend to other packages
Share your feedback!https://www.surveymonkey.com/r/ICSPBDD