Day 1 - University of...
Transcript of Day 1 - University of...
![Page 1: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/1.jpg)
8/13/2015
1
Stata Day 1
Jacob Fowles
Assistant Professor
School of Public Affairs and Administration
Affiliated Faculty, CRMDA
Why Stata?• Price
» More expensive than R
» Less expensive than SPSS and SAS• No modules or add-ons to purchase
• Full set of PDF manuals are included
• Robustness
» Pre-canned capabilities are extensive and fast
» Cross-platform compatibility (Windows, *nix, Mac). Licenses are not platform-specific. Cloud computing is a limitation
• Extendibility
» Stata interfaces easily with R, ODBC, as well as internet data repositories
» User-written ado files extend Stata’s capabilities (cutting edge vs. bleeding edge)
The "flavors" of Stata• You can purchase Stata through KU’s GradPlan at a
discounted rate. Which one to get?» Small Stata (99 variables and 1200 observations per
dataset): $49 per year
» Stata/IC (2047 variables, 798 covariates, no obs. limits): $179
» Stata/SE (32,767 variables, 10,998 covariates, no obs. limits): $395
» Stata/MP (multiprocessor version of SE): $845+
• Bottom line: if you are paying, buy Stata/IC unless you have a specific reason to spend more
![Page 2: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/2.jpg)
8/13/2015
2
The "quirks" of Stata• Stata is powerful, but expects that you know what you are
doing» Upside: No annoying "are you sure you want to do this?"
windows
» Downside: No annoying "are you sure you want to do this?" windows
• Stata will faithfully do what you ask of it and rarely question your expertise
• When you commit large errors, it will quietly do its best correct them (often without alerting you)
• Stata still largely functions with the "one-dataset-at-a-time" approach—but Mata adds much flexibility
The Stata Interface
Data Editor
![Page 3: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/3.jpg)
8/13/2015
3
Variable Manager
Do-file Editor
Getting Help: help command
![Page 4: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/4.jpg)
8/13/2015
4
Getting Help: findit keyword
Other Sources of Help• Google (Stata’s online help files are indexed, as are the
slides from presentations given at Stata conferences)
• Statalist (http://www.statalist.org/)
• The Stata Journal (open access only)
• The Stata Blog (http://blog.stata.com)
• UCLA Academic Technology Services (http://www.ats.ucla.edu/stat/stata/)
• Stata Press publications:» The Workflow of Data Analysis Using Stata by J. Scott Long
» An Introduction to Stata Programming by Christopher Baum
» Microeconometrics Using Stata by Cameron and Trivedi
Extending Stata’s Capabilities• Stata’s pre-canned estimation commands are called ado-
files (not to be confused with do-files)
• user-written ado-file packages can be located and added to expand Stata’s capabilities
• net install ado-package installs packages from Stata’s web repositories
• findit search_term searches online help, the Stata Journal, Stata FAQs, and other "approved" online Stata repositories
• ssc hot gives you the list of the most popular ado-file hosted by SSC (Statistical Software Components) archive at Boston College
![Page 5: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/5.jpg)
8/13/2015
5
Keep Your Own ado-file Repository• You can use the sysdir command to tell Stata where to
install and look for ado-files that you download from the web:
.sysdir set PLUS "C:\my documents\ado\plus"
• You will have to issue this command every time Stata starts, so it is good practice to put this in your do-file template.
• Bonus tip: if you keep your ado-files in a Dropbox folder that syncs across all your computers, you will have access to them everywhere:
.sysdir set PLUS "D:\Dropbox\ado\plus"
8/13/2015 13
Example: Installing estout
The Importance of Reproducibility• Stata has an efficient mechanism for keeping track of your
actions: log files• Get in the habit of always opening a log file before you do
anything else• Two ways to do this:
» File >> Log >> Begin» log using "filename“, text
• Log files make for good social science (and make your life easier)
• Choosing the text option makes your do-files more accessible (you can open them directly in Word, notepad, etc.)
![Page 6: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/6.jpg)
8/13/2015
6
Log File
Do-Files• Do-files: "The Recipe"
» Allows you to save your commands so you don’t have to re-type them next time (or have to remember what you did last time)
» Can (should?) be used to record research notes
» Do Files can be opened three ways:• File >> Do• doedit
• Or using the tool bar
8/13/2015
Handy Tips: Do-files• A line beginning with an * is skipped when a do-file runs• cd "c:\somewhere" sets the working directory, so you don’t have
to make full path references to files
• File paths should be enclosed within double quotes (")• clear all or cscript at the beginning of the do-file reset Stata to
"like new" by clearing macros, data, etc. that may be hanging around• set more off instructs Stata not to pause when the output window
gets full• set rmsg on instructs Stata to keep track and display how long
each command takes to complete (in seconds)• If you insert a double slash (//) in a do-file, Stata treats everything
after it as a comment• If you insert a triple slash (///) in a do-file, Stata treats everything on
the next line as though it were part of the previous line
![Page 7: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/7.jpg)
8/13/2015
7
Do-files are Good Research PracticeGet in the habit of using do-files for everything you do in Stata. Your do-file(s) should:
1. Set up the Stata environment (log, directories, etc.)
2. Open your (raw/clean/untouched) data
3. Do whatever you need to do to the variables (recode, transform, clean, etc.)
4. Perform whatever analysis or statistical tests you want to run on the transformed dataset
5. Save the results as output somewhere, and (optionally) save your cleaned dataset as a new file
NEVER overwrite your raw/clean/untouched data file
19
Stata’s Do-file Editor
8/13/2015 20
Commands are in blue
Data files are in red
Text and notes are in green
Variable names and command options are in black
Click here to run your do-file (or only the highlighted lines).
Handy Trick: When you Forget to Automate the Important
The log2do2 ado-file (ssc install log2do2) will reconstruct a do-file from
a text format log-file
![Page 8: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/8.jpg)
8/13/2015
8
Efficiency in Research: The Power of Organization
• C:\documents\projects\project_1» \datasets
• \raw
• \clean
» \do_files
» \log_files
» \manuscript• \archived
• \tables_figures
» \literature
Data types• Stata recognizes three unique types of data:
» Strings (alphanumeric characters), generally referenced in double quotes (")• count if marital_status==married doesn’t work
• count if marital_status=="married" does
» Numeric (byte, integer, long, float, double)
» Dates and times
• Missing data» Strings: blank (represented as "" in the command line)
» Numeric: . (.a - .z are also recognized as missing)
» Missing numeric values are treated as infinitely large• count if age>20 will count missing values
• count if age>20 & age<. won’t count missing values
Data Types
8/13/2015
StringNumeric
Numeric with assigned value labels
![Page 9: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/9.jpg)
8/13/2015
9
The Syntax of Stata CommandsAll Stata official commands (and mostly all user-written ado files) follow the same syntax:
.command name list of something (variables, files, etc.) if qualifying statement, command options
Some examples:.use "C:\datasets\dataset1.dta", clear
.summarize F3ERN2011 if BYSEX==1, detail
.regress F3ERN2011 i.BYSEX, robust
You can use help command to view the specific syntax and options associated with a particular command
8/13/2015 25
Obtaining Data• Stata is internet-savvy. It can download anything that you
can access through a web browser. Combining this with do-files (and automating with loops!) can make your life much easier.
• copy http://www.somewhere.com/sheet1.xls "C:\project1\datasets\sheet1.xls", replace
• copy http://en.wikipedia.org/wiki/Justin_Bieber "C:\singers\canadian\beibs.htm"
• It can also unzip things:• unzipfile "d:\archive.zip", replace
8/13/2015 26
Importing Data into Stata• use
» Useful when the data are in Stata format and either:• Accessible via a website (use http://whatever.dta)
• Already saved locally as a .dta file (use "c:\whatever.dta")
• insheet
» Useful when data are saved as a .csv (comma separated value) file
» Open a .csv file in Stata:• insheet using "filename", comma
» This command also reads tab-delimited files• insheet using "filename", tab
» And files delimited by other characters• insheet using "filename", delimiter("char")
![Page 10: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/10.jpg)
8/13/2015
10
Importing Data into Stata from Excel• Using the data editor
» Enter data "by hand"» Copy and paste from Excel
• Open data in Excel, highlight entire sheet, Ctrl-C• Open data editor in Stata (type edit on the command line),
Ctrl-V
• import excel» Allows for Excel files to be directly imported into Stata» import excel "filename" , options» This functionality has been expanded in Stata 13/14 (and export excel "filename", too
Combining Datasets
• Stata has commands for combining datasets:• append
» stacks datasets, one on top of the other
» You can think of it as lengthening a sheet of paper
• merge
» combines datasets using a common unique identifier
» Allows 1-1, many-to-1 or many-to-many (don’t do M2M merges)
» You can think of it as "widening" a sheet of paper
• The syntax for these command are a bit tricky. Get in the habit of checking the results, especially for merges (use the _merge variable)
Describing data• Useful commands:
» describe (d)
» codebook
» summarize (sum)
» summarize, detail (sum, d)
» tab
» histogram
» scatter
» lfit
![Page 11: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/11.jpg)
8/13/2015
11
describe
. describe
Contains data from D:\Desktop\ELS_subset_20140724.dtaobs: 16,132 Written by R. vars: 27 size: 2,548,856 --------------------------------------------------------------------------------------------------------------
storage display valuevariable name type format label variable label--------------------------------------------------------------------------------------------------------------STU_ID long %9.0g STU_IDF1DOB_P long %9.0g F1DOB_PBYSEX long %9.0g BYSEX BYSEXBYRACE long %9.0g BYRACE BYRACEBYMOTHED long %9.0g BYMOTHED BYMOTHEDBYFATHED long %9.0g BYFATHED BYFATHEDBYGPARED long %9.0g BYGPARED BYGPAREDBYINCOME long %9.0g BYINCOME BYINCOMEBYSES1 double %9.0g BYSES1
codebook. codebook BYSEX
-------------------------------------------------------------------------BYSEX BYSEX-------------------------------------------------------------------------
type: numeric (long)label: BYSEX
range: [1,2] units: 1unique values: 2 missing .: 766/16132
tabulation: Freq. Numeric Label7651 1 Male7715 2 Female766 .
summarize
. sum BYRACE BYMOTHED BYFATHED BYINCOME
Variable | Obs Mean Std. Dev. Min Max-------------+--------------------------------------------------------
BYRACE | 15244 5.535489 1.906215 1 7BYMOTHED | 15318 3.721504 2.0123 1 8BYFATHED | 15301 3.868244 2.208072 1 8BYINCOME | 16132 9.0561 2.426787 1 13
![Page 12: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/12.jpg)
8/13/2015
12
sum variable, detail. sum F3STLOANAMT, detail
F3STLOANAMT-------------------------------------------------------------
Percentiles Smallest1% 1000 3505% 2500 35010% 4000 350 Obs 685925% 10000 350 Sum of Wgt. 6859
50% 20000 Mean 32662.6Largest Std. Dev. 40299.61
75% 40000 30000090% 75000 300000 Variance 1.62e+0995% 100000 300000 Skewness 3.09436799% 220000 300000 Kurtosis 15.67822
tabulate. tab BYINCOME
BYINCOME | Freq. Percent Cum.------------------+-----------------------------------
None | 80 0.50 0.50$1,000 or less | 178 1.10 1.60$1,001-$5,000 | 304 1.88 3.48$5,001-$10,000 | 350 2.17 5.65$10,001-$15,000 | 697 4.32 9.97$15,001-$20,000 | 777 4.82 14.79$20,001-$25,000 | 997 6.18 20.97$25,001-$35,000 | 1,888 11.70 32.67$35,001-$50,000 | 3,012 18.67 51.35$50,001-$75,000 | 3,298 20.44 71.79$75,001-$100,000 | 2,166 13.43 85.22$100,001-$200,000 | 1,804 11.18 96.40$200,001 or more | 581 3.60 100.00------------------+-----------------------------------
Total | 16,132 100.00
tabulate. tab BYINCOME, nolab
BYINCOME | Freq. Percent Cum.------------+-----------------------------------
1 | 80 0.50 0.502 | 178 1.10 1.603 | 304 1.88 3.484 | 350 2.17 5.655 | 697 4.32 9.976 | 777 4.82 14.797 | 997 6.18 20.978 | 1,888 11.70 32.679 | 3,012 18.67 51.3510 | 3,298 20.44 71.7911 | 2,166 13.43 85.2212 | 1,804 11.18 96.4013 | 581 3.60 100.00
------------+-----------------------------------Total | 16,132 100.00
![Page 13: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/13.jpg)
8/13/2015
13
tabulate. tab BYRACE BYSEX
| BYSEXBYRACE | Male Female | Total
----------------------+----------------------+----------Amer. Indian/Alaska N | 72 58 | 130 Asian, Hawaii/Pac. Is | 738 722 | 1,460 Black or African Amer | 1,004 1,016 | 2,020 Hispanic, no race spe | 498 498 | 996 Hispanic, race specif | 601 620 | 1,221 More than one race, n | 368 367 | 735 White, non-Hispanic | 4,297 4,385 | 8,682
----------------------+----------------------+----------Total | 7,578 7,666 | 15,244
histogram• histogram BYTXRSTD, normal
38
0.0
1.0
2.0
3.0
4D
en
sity
20 40 60 80BYTXRSTD
scatter and lfit.graph twoway (scatter BYTXMSTD BYTXRSTD) (lfit BYTXMSTD BYTXRSTD)
39
20
40
60
80
100
20 40 60 80BYTXRSTD
BYTXMSTD Fitted values
![Page 14: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/14.jpg)
8/13/2015
14
Variable and Value Labels• Renaming a variable: rename
» rename BYSEX gender
• Labeling a variable: label variable » label gender "Gender of respondent"
• Labeling variable values» Step 1: define a set of labels
• label define gender_labels 1 "male" 2 "female"
» Step 2: attach the set of labels to variable• label values gender gender_labels
Logical Operators
& and
| or
!= not equal
>= greater than or equal to
<= less than or equal to
== equals (note that = is a mathematical and not a logical operator)
Using logical operators
.tab BYFATHED if BYSEX==1
.tab BYFATHED if BYSEX!=2
.count if BYSES1==.
.sum F3ERN2011
.sum F3ERN2011 if BYMOTHED>=6
![Page 15: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/15.jpg)
8/13/2015
15
Creating and transforming data• Useful commands:
» preserve and restore
» drop and keep
» destring and tostring
» encode and decode
» generate
» replace
» egen
preserve and restore• preserve and restore take snapshots of your
dataset, permitting you to revert and undo changes
• They must be used together: only one preserve is allowed at a time and it cannot be used again without a restore being issued first:.preserve
.restore
.restore, not (this cancels the previous preserve without actually restoring)
drop and keep• drop can be used to drop certain observations,
observations based on a characteristic you specify, or a variable or list of variables:
.preserve
.drop BYINCOME BYSEX
.drop in 6
.keep if BYSEX!=.
.restore
![Page 16: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/16.jpg)
8/13/2015
16
String to numeric (and vice versa)• You can convert a string variable to a numeric (if it
contains only numeric values) with destringvarname» destring varname, replace ignore(",") will
replace the string variable, ignoring commas in otherwise numeric variables
» destring varname, replace force will code any cell containing non-numeric characters as missing
» destring varname, gen(newvar)will create a new numeric variable using the old string variable
• Conversely, tostring will convert a numeric variable to a string variable
encode and decode
• encode creates a numeric variable from a string variable, and automatically creates variable labels based on the string. decode does the opposite, using the value labels as the variable values.
.decode BYSEX, gen(BYSEX2)
.tab BYSEX2 BYSEX
.tab BYSEX2, nolabel
.encode BYSEX2, gen(BYSEX3)
generate
• generate (gen): Creates a new variable
.sum F3ERN2011, detail
.gen log_earn = ln(F3ERN2011)
.gen earn_sq = F3ERN2011*F3ERN2011
.gen blank_string = ""
.gen blank_num = .
• Standardize (z-score) a variable using generate:
.gen z_earn = (F3ERN2011–26009)/23993
.summarize z_earn
![Page 17: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/17.jpg)
8/13/2015
17
String to date
• generate newvar = date(varname, "date format") will convert a string variable into Stata’s standardized format for dates (0= Jan 1, 1960)» "date format" reflects the formatting of the string
• 22 June 1979 = "DMY"
• June 22, 1979 = "MDY"
• format newvar = %td will make the actual date display in the editor window, rather than the number representing the day
• Why bother with this?
replace
• replace: Writes over a variable with a new value
.replace F3ERN2011=.
.replace F3ERN2011=. if F3ERN2011>=200000
.replace F3ERN2011 = . in 20
Using conditional operators
• Create a dummy variable using if:.tab BYRACE
.gen hispanic_dum = .
.replace hispanic_dum=0 if BYRACE==1 | BYRACE==2 | BYRACE==3 | BYRACE==6 | BYRACE==7
.replace hispanic_dum = 1 if BYRACE==4 | BYRACE==5
.label var hispanic_dum "Respondent is Hispanic"
.tab hispanic_dum BYRACE
![Page 18: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/18.jpg)
8/13/2015
18
Using conditional operators
Create a dummy variable using cond:
.sum F3ERN2011, detail
.gen rich_dum = cond(F3ERN2011>=52500 & F3ERN2011<.,1,0)
But is this what we really wanted?
.tab rich_dum if F3ERN2011==.
Handy Trick: Missing Values
Stata treats missing numeric values as infinitely large for the purposes of transformations involving inequalities
Therefore, if we do this:.gen rich_dum = cond(F3ERN2011>=52500 & F3ERN2011<.,1,0)
Any observations with missing values will be coded as a 0. So get in the habit of doing this instead:.gen rich_dum = cond(F3ERN2011>=52500 & F3ERN2011<.,1,0) if F3ERN2011!=.
egen
• egen: Extensions to generate (pre-canned generate functions for common data transformation)
• Standardize wage using egen:.egen z_earn2 = std(F3ERN2011)
.summarize z_earn2
• Generate group-level means using egen:.by BYSEX BYRACE, sort: egen mean_earn = mean(F3ERN2011)
.summarize F3ERN2011 if BYSEX==1 & BYRACE==7
![Page 19: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/19.jpg)
8/13/2015
19
Handy Tip: egenmoreThe egenmore user-written ado-file (ssc install egenmore) extends the usefulness of egen by adding an extra 30ish pre-canned functions that can be called through egen. Some (randomly chosen) examples:
• rowmedian calculates the median across observations of the variables specified
• noccur calculates the number of times a substring appears in a string variable
• xtile categories variables by percentiles or quantiles
Factor Notation and Regressions• Stata recently (version 11, I think) introduced factor-
variable notation when estimation models:» The i. prefix designates that you want Stata to treat the
variable as sets of indicator (dummy) variables, so you don't have to create them yourself! So:
.regress F3ERN2011 BYSEX BYRACE
treats gender and race as continuous variables, while.regress F3ERN2011 i.BYSEX i.BYRACE
treats both variables as categorical, estimating a unique intercept for each category» The c. prefix designates variables that should be treated as
continuous (the default assumption if i is not specified).
Factor Notation• You can do some clever things:
ib#. allows you to set the omitted (base) category
.regress F3ERN2011 BYSEX ib7.BYRACE
ib(freq). selects the modal category as base (default is smallest category)
i#. creates a binary dummy from your categorical variable:
.regress F3ERN2011 BYSEX i7.BYRACE
Thereby comparing whites to non-whites (the omitted base category)
![Page 20: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/20.jpg)
8/13/2015
20
Factor NotationThis is also quite handy for incorporating higher order terms and interactions in your model:.regress F3ERN2011 i.BYSEX i.BYRACEc.BYTXRSTD##c.BYTXRSTD
This estimates a model predicting earnings including gender and race as categorical variables. It also includes standardized math test score and its squared term.
.regress F3ERN2011 i.BYSEX##c.BYTXMSTD i.BYRACE
This estimates a model predicting earnings including gender, race, and standardized math score. The interaction of gender and math score is also included, allowing the impact of math score on earnings to differ by gender.
Factor Notation• Another benefit of using factor notation is that Stata is
aware that quadratics are products of lower order terms and interaction terms are products of other variables. This makes calculating predicted values (and other interesting post-estimation statistics) trivial.
• Let's calculate the predicted differences in the earnings of a white, high scoring male vs. a white, high scoring female, using the old way and using factor notation
The Old WayFirst, we create the interaction term:.gen genderXmath = BYSEX*BYTXMSTD
Then, we include it in the regression:.regress F3ERN2011 BYSEX BYTXMSTD genderXmathi.BYRACE
Now, we calculate our fitted value for women:.margins, at(BYSEX=2 BYTXMSTD=63.14 BYRACE=7)
We get a fitted value of $20,341.63. And men:.margins, at(BYSEX=1 BYTXMSTD=63.14 BYRACE=7)
We get a fitted value of $37,811.81.
These results say that men earn $17,540.18 more.
![Page 21: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration](https://reader034.fdocuments.net/reader034/viewer/2022042107/5e8707b9b97a6e5a20378d58/html5/thumbnails/21.jpg)
8/13/2015
21
The Factor Notation WayRun the regression:.regress F3ERN2011 i.BYSEX##c.BYTXMSTDi.BYRACE
Now, we calculate our fitted value for women:.margins, at(BYSEX=2 BYTXMSTD=63.14 BYRACE=7)
We get a fitted value of $30,880.54. And for men:.margins, at(BYSEX=1 BYTXMSTD=63.14 BYRACE=7)
We get a fitted value of $34,498.93.
These results say that men earn $3618.39 more, on average.
What accounts for this difference?
Remember your Calculus?• The partial effect of gender on earnings in our equation
isn't just the coefficient on gender. It is the coefficient on gender times the product of the coefficient on the interaction term and the value we specify for math test score.
• Doing it the old way, Stata didn't know that the interaction term is the product of the two first-order terms. It just treated it like any other variable and held constant when calculating the predicted values
• Doing it the new way, Stata knew not to hold the interaction term constant, yielding the correct fitted values.