Day 1 - University of...

21
8/13/2015 1 Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration Affiliated Faculty, CRMDA Why Stata? Price » More expensive than R » Less expensive than SPSS and SAS No modules or add-ons to purchase Full set of PDF manuals are included Robustness » Pre-canned capabilities are extensive and fast » Cross-platform compatibility (Windows, *nix, Mac). Licenses are not platform-specific. Cloud computing is a limitation Extendibility » Stata interfaces easily with R, ODBC, as well as internet data repositories » User-written ado files extend Stata’s capabilities (cutting edge vs. bleeding edge) The "flavors" of Stata You can purchase Stata through KU’s GradPlan at a discounted rate. Which one to get? » Small Stata (99 variables and 1200 observations per dataset): $49 per year » Stata/IC (2047 variables, 798 covariates, no obs. limits): $179 » Stata/SE (32,767 variables, 10,998 covariates, no obs. limits): $395 » Stata/MP (multiprocessor version of SE): $845+ Bottom line: if you are paying, buy Stata/IC unless you have a specific reason to spend more

Transcript of Day 1 - University of...

Page 1: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

1

Stata Day 1

Jacob Fowles

Assistant Professor

School of Public Affairs and Administration

Affiliated Faculty, CRMDA

Why Stata?• Price

» More expensive than R

» Less expensive than SPSS and SAS• No modules or add-ons to purchase

• Full set of PDF manuals are included

• Robustness

» Pre-canned capabilities are extensive and fast

» Cross-platform compatibility (Windows, *nix, Mac). Licenses are not platform-specific. Cloud computing is a limitation

• Extendibility

» Stata interfaces easily with R, ODBC, as well as internet data repositories

» User-written ado files extend Stata’s capabilities (cutting edge vs. bleeding edge)

The "flavors" of Stata• You can purchase Stata through KU’s GradPlan at a

discounted rate. Which one to get?» Small Stata (99 variables and 1200 observations per

dataset): $49 per year

» Stata/IC (2047 variables, 798 covariates, no obs. limits): $179

» Stata/SE (32,767 variables, 10,998 covariates, no obs. limits): $395

» Stata/MP (multiprocessor version of SE): $845+

• Bottom line: if you are paying, buy Stata/IC unless you have a specific reason to spend more

Page 2: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

2

The "quirks" of Stata• Stata is powerful, but expects that you know what you are

doing» Upside: No annoying "are you sure you want to do this?"

windows

» Downside: No annoying "are you sure you want to do this?" windows

• Stata will faithfully do what you ask of it and rarely question your expertise

• When you commit large errors, it will quietly do its best correct them (often without alerting you)

• Stata still largely functions with the "one-dataset-at-a-time" approach—but Mata adds much flexibility

The Stata Interface

Data Editor

Page 3: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

3

Variable Manager

Do-file Editor

Getting Help: help command

Page 4: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

4

Getting Help: findit keyword

Other Sources of Help• Google (Stata’s online help files are indexed, as are the

slides from presentations given at Stata conferences)

• Statalist (http://www.statalist.org/)

• The Stata Journal (open access only)

• The Stata Blog (http://blog.stata.com)

• UCLA Academic Technology Services (http://www.ats.ucla.edu/stat/stata/)

• Stata Press publications:» The Workflow of Data Analysis Using Stata by J. Scott Long

» An Introduction to Stata Programming by Christopher Baum

» Microeconometrics Using Stata by Cameron and Trivedi

Extending Stata’s Capabilities• Stata’s pre-canned estimation commands are called ado-

files (not to be confused with do-files)

• user-written ado-file packages can be located and added to expand Stata’s capabilities

• net install ado-package installs packages from Stata’s web repositories

• findit search_term searches online help, the Stata Journal, Stata FAQs, and other "approved" online Stata repositories

• ssc hot gives you the list of the most popular ado-file hosted by SSC (Statistical Software Components) archive at Boston College

Page 5: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

5

Keep Your Own ado-file Repository• You can use the sysdir command to tell Stata where to

install and look for ado-files that you download from the web:

.sysdir set PLUS "C:\my documents\ado\plus"

• You will have to issue this command every time Stata starts, so it is good practice to put this in your do-file template.

• Bonus tip: if you keep your ado-files in a Dropbox folder that syncs across all your computers, you will have access to them everywhere:

.sysdir set PLUS "D:\Dropbox\ado\plus"

8/13/2015 13

Example: Installing estout

The Importance of Reproducibility• Stata has an efficient mechanism for keeping track of your

actions: log files• Get in the habit of always opening a log file before you do

anything else• Two ways to do this:

» File >> Log >> Begin» log using "filename“, text

• Log files make for good social science (and make your life easier)

• Choosing the text option makes your do-files more accessible (you can open them directly in Word, notepad, etc.)

Page 6: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

6

Log File

Do-Files• Do-files: "The Recipe"

» Allows you to save your commands so you don’t have to re-type them next time (or have to remember what you did last time)

» Can (should?) be used to record research notes

» Do Files can be opened three ways:• File >> Do• doedit

• Or using the tool bar

8/13/2015

Handy Tips: Do-files• A line beginning with an * is skipped when a do-file runs• cd "c:\somewhere" sets the working directory, so you don’t have

to make full path references to files

• File paths should be enclosed within double quotes (")• clear all or cscript at the beginning of the do-file reset Stata to

"like new" by clearing macros, data, etc. that may be hanging around• set more off instructs Stata not to pause when the output window

gets full• set rmsg on instructs Stata to keep track and display how long

each command takes to complete (in seconds)• If you insert a double slash (//) in a do-file, Stata treats everything

after it as a comment• If you insert a triple slash (///) in a do-file, Stata treats everything on

the next line as though it were part of the previous line

Page 7: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

7

Do-files are Good Research PracticeGet in the habit of using do-files for everything you do in Stata. Your do-file(s) should:

1. Set up the Stata environment (log, directories, etc.)

2. Open your (raw/clean/untouched) data

3. Do whatever you need to do to the variables (recode, transform, clean, etc.)

4. Perform whatever analysis or statistical tests you want to run on the transformed dataset

5. Save the results as output somewhere, and (optionally) save your cleaned dataset as a new file

NEVER overwrite your raw/clean/untouched data file

19

Stata’s Do-file Editor

8/13/2015 20

Commands are in blue

Data files are in red

Text and notes are in green

Variable names and command options are in black

Click here to run your do-file (or only the highlighted lines).

Handy Trick: When you Forget to Automate the Important

The log2do2 ado-file (ssc install log2do2) will reconstruct a do-file from

a text format log-file

Page 8: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

8

Efficiency in Research: The Power of Organization

• C:\documents\projects\project_1» \datasets

• \raw

• \clean

» \do_files

» \log_files

» \manuscript• \archived

• \tables_figures

» \literature

Data types• Stata recognizes three unique types of data:

» Strings (alphanumeric characters), generally referenced in double quotes (")• count if marital_status==married doesn’t work

• count if marital_status=="married" does

» Numeric (byte, integer, long, float, double)

» Dates and times

• Missing data» Strings: blank (represented as "" in the command line)

» Numeric: . (.a - .z are also recognized as missing)

» Missing numeric values are treated as infinitely large• count if age>20 will count missing values

• count if age>20 & age<. won’t count missing values

Data Types

8/13/2015

StringNumeric

Numeric with assigned value labels

Page 9: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

9

The Syntax of Stata CommandsAll Stata official commands (and mostly all user-written ado files) follow the same syntax:

.command name list of something (variables, files, etc.) if qualifying statement, command options

Some examples:.use "C:\datasets\dataset1.dta", clear

.summarize F3ERN2011 if BYSEX==1, detail

.regress F3ERN2011 i.BYSEX, robust

You can use help command to view the specific syntax and options associated with a particular command

8/13/2015 25

Obtaining Data• Stata is internet-savvy. It can download anything that you

can access through a web browser. Combining this with do-files (and automating with loops!) can make your life much easier.

• copy http://www.somewhere.com/sheet1.xls "C:\project1\datasets\sheet1.xls", replace

• copy http://en.wikipedia.org/wiki/Justin_Bieber "C:\singers\canadian\beibs.htm"

• It can also unzip things:• unzipfile "d:\archive.zip", replace

8/13/2015 26

Importing Data into Stata• use

» Useful when the data are in Stata format and either:• Accessible via a website (use http://whatever.dta)

• Already saved locally as a .dta file (use "c:\whatever.dta")

• insheet

» Useful when data are saved as a .csv (comma separated value) file

» Open a .csv file in Stata:• insheet using "filename", comma

» This command also reads tab-delimited files• insheet using "filename", tab

» And files delimited by other characters• insheet using "filename", delimiter("char")

Page 10: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

10

Importing Data into Stata from Excel• Using the data editor

» Enter data "by hand"» Copy and paste from Excel

• Open data in Excel, highlight entire sheet, Ctrl-C• Open data editor in Stata (type edit on the command line),

Ctrl-V

• import excel» Allows for Excel files to be directly imported into Stata» import excel "filename" , options» This functionality has been expanded in Stata 13/14 (and export excel "filename", too

Combining Datasets

• Stata has commands for combining datasets:• append

» stacks datasets, one on top of the other

» You can think of it as lengthening a sheet of paper

• merge

» combines datasets using a common unique identifier

» Allows 1-1, many-to-1 or many-to-many (don’t do M2M merges)

» You can think of it as "widening" a sheet of paper

• The syntax for these command are a bit tricky. Get in the habit of checking the results, especially for merges (use the _merge variable)

Describing data• Useful commands:

» describe (d)

» codebook

» summarize (sum)

» summarize, detail (sum, d)

» tab

» histogram

» scatter

» lfit

Page 11: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

11

describe

. describe

Contains data from D:\Desktop\ELS_subset_20140724.dtaobs: 16,132 Written by R. vars: 27 size: 2,548,856 --------------------------------------------------------------------------------------------------------------

storage display valuevariable name type format label variable label--------------------------------------------------------------------------------------------------------------STU_ID long %9.0g STU_IDF1DOB_P long %9.0g F1DOB_PBYSEX long %9.0g BYSEX BYSEXBYRACE long %9.0g BYRACE BYRACEBYMOTHED long %9.0g BYMOTHED BYMOTHEDBYFATHED long %9.0g BYFATHED BYFATHEDBYGPARED long %9.0g BYGPARED BYGPAREDBYINCOME long %9.0g BYINCOME BYINCOMEBYSES1 double %9.0g BYSES1

codebook. codebook BYSEX

-------------------------------------------------------------------------BYSEX BYSEX-------------------------------------------------------------------------

type: numeric (long)label: BYSEX

range: [1,2] units: 1unique values: 2 missing .: 766/16132

tabulation: Freq. Numeric Label7651 1 Male7715 2 Female766 .

summarize

. sum BYRACE BYMOTHED BYFATHED BYINCOME

Variable | Obs Mean Std. Dev. Min Max-------------+--------------------------------------------------------

BYRACE | 15244 5.535489 1.906215 1 7BYMOTHED | 15318 3.721504 2.0123 1 8BYFATHED | 15301 3.868244 2.208072 1 8BYINCOME | 16132 9.0561 2.426787 1 13

Page 12: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

12

sum variable, detail. sum F3STLOANAMT, detail

F3STLOANAMT-------------------------------------------------------------

Percentiles Smallest1% 1000 3505% 2500 35010% 4000 350 Obs 685925% 10000 350 Sum of Wgt. 6859

50% 20000 Mean 32662.6Largest Std. Dev. 40299.61

75% 40000 30000090% 75000 300000 Variance 1.62e+0995% 100000 300000 Skewness 3.09436799% 220000 300000 Kurtosis 15.67822

tabulate. tab BYINCOME

BYINCOME | Freq. Percent Cum.------------------+-----------------------------------

None | 80 0.50 0.50$1,000 or less | 178 1.10 1.60$1,001-$5,000 | 304 1.88 3.48$5,001-$10,000 | 350 2.17 5.65$10,001-$15,000 | 697 4.32 9.97$15,001-$20,000 | 777 4.82 14.79$20,001-$25,000 | 997 6.18 20.97$25,001-$35,000 | 1,888 11.70 32.67$35,001-$50,000 | 3,012 18.67 51.35$50,001-$75,000 | 3,298 20.44 71.79$75,001-$100,000 | 2,166 13.43 85.22$100,001-$200,000 | 1,804 11.18 96.40$200,001 or more | 581 3.60 100.00------------------+-----------------------------------

Total | 16,132 100.00

tabulate. tab BYINCOME, nolab

BYINCOME | Freq. Percent Cum.------------+-----------------------------------

1 | 80 0.50 0.502 | 178 1.10 1.603 | 304 1.88 3.484 | 350 2.17 5.655 | 697 4.32 9.976 | 777 4.82 14.797 | 997 6.18 20.978 | 1,888 11.70 32.679 | 3,012 18.67 51.3510 | 3,298 20.44 71.7911 | 2,166 13.43 85.2212 | 1,804 11.18 96.4013 | 581 3.60 100.00

------------+-----------------------------------Total | 16,132 100.00

Page 13: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

13

tabulate. tab BYRACE BYSEX

| BYSEXBYRACE | Male Female | Total

----------------------+----------------------+----------Amer. Indian/Alaska N | 72 58 | 130 Asian, Hawaii/Pac. Is | 738 722 | 1,460 Black or African Amer | 1,004 1,016 | 2,020 Hispanic, no race spe | 498 498 | 996 Hispanic, race specif | 601 620 | 1,221 More than one race, n | 368 367 | 735 White, non-Hispanic | 4,297 4,385 | 8,682

----------------------+----------------------+----------Total | 7,578 7,666 | 15,244

histogram• histogram BYTXRSTD, normal

38

0.0

1.0

2.0

3.0

4D

en

sity

20 40 60 80BYTXRSTD

scatter and lfit.graph twoway (scatter BYTXMSTD BYTXRSTD) (lfit BYTXMSTD BYTXRSTD)

39

20

40

60

80

100

20 40 60 80BYTXRSTD

BYTXMSTD Fitted values

Page 14: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

14

Variable and Value Labels• Renaming a variable: rename

» rename BYSEX gender

• Labeling a variable: label variable » label gender "Gender of respondent"

• Labeling variable values» Step 1: define a set of labels

• label define gender_labels 1 "male" 2 "female"

» Step 2: attach the set of labels to variable• label values gender gender_labels

Logical Operators

& and

| or

!= not equal

>= greater than or equal to

<= less than or equal to

== equals (note that = is a mathematical and not a logical operator)

Using logical operators

.tab BYFATHED if BYSEX==1

.tab BYFATHED if BYSEX!=2

.count if BYSES1==.

.sum F3ERN2011

.sum F3ERN2011 if BYMOTHED>=6

Page 15: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

15

Creating and transforming data• Useful commands:

» preserve and restore

» drop and keep

» destring and tostring

» encode and decode

» generate

» replace

» egen

preserve and restore• preserve and restore take snapshots of your

dataset, permitting you to revert and undo changes

• They must be used together: only one preserve is allowed at a time and it cannot be used again without a restore being issued first:.preserve

.restore

.restore, not (this cancels the previous preserve without actually restoring)

drop and keep• drop can be used to drop certain observations,

observations based on a characteristic you specify, or a variable or list of variables:

.preserve

.drop BYINCOME BYSEX

.drop in 6

.keep if BYSEX!=.

.restore

Page 16: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

16

String to numeric (and vice versa)• You can convert a string variable to a numeric (if it

contains only numeric values) with destringvarname» destring varname, replace ignore(",") will

replace the string variable, ignoring commas in otherwise numeric variables

» destring varname, replace force will code any cell containing non-numeric characters as missing

» destring varname, gen(newvar)will create a new numeric variable using the old string variable

• Conversely, tostring will convert a numeric variable to a string variable

encode and decode

• encode creates a numeric variable from a string variable, and automatically creates variable labels based on the string. decode does the opposite, using the value labels as the variable values.

.decode BYSEX, gen(BYSEX2)

.tab BYSEX2 BYSEX

.tab BYSEX2, nolabel

.encode BYSEX2, gen(BYSEX3)

generate

• generate (gen): Creates a new variable

.sum F3ERN2011, detail

.gen log_earn = ln(F3ERN2011)

.gen earn_sq = F3ERN2011*F3ERN2011

.gen blank_string = ""

.gen blank_num = .

• Standardize (z-score) a variable using generate:

.gen z_earn = (F3ERN2011–26009)/23993

.summarize z_earn

Page 17: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

17

String to date

• generate newvar = date(varname, "date format") will convert a string variable into Stata’s standardized format for dates (0= Jan 1, 1960)» "date format" reflects the formatting of the string

• 22 June 1979 = "DMY"

• June 22, 1979 = "MDY"

• format newvar = %td will make the actual date display in the editor window, rather than the number representing the day

• Why bother with this?

replace

• replace: Writes over a variable with a new value

.replace F3ERN2011=.

.replace F3ERN2011=. if F3ERN2011>=200000

.replace F3ERN2011 = . in 20

Using conditional operators

• Create a dummy variable using if:.tab BYRACE

.gen hispanic_dum = .

.replace hispanic_dum=0 if BYRACE==1 | BYRACE==2 | BYRACE==3 | BYRACE==6 | BYRACE==7

.replace hispanic_dum = 1 if BYRACE==4 | BYRACE==5

.label var hispanic_dum "Respondent is Hispanic"

.tab hispanic_dum BYRACE

Page 18: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

18

Using conditional operators

Create a dummy variable using cond:

.sum F3ERN2011, detail

.gen rich_dum = cond(F3ERN2011>=52500 & F3ERN2011<.,1,0)

But is this what we really wanted?

.tab rich_dum if F3ERN2011==.

Handy Trick: Missing Values

Stata treats missing numeric values as infinitely large for the purposes of transformations involving inequalities

Therefore, if we do this:.gen rich_dum = cond(F3ERN2011>=52500 & F3ERN2011<.,1,0)

Any observations with missing values will be coded as a 0. So get in the habit of doing this instead:.gen rich_dum = cond(F3ERN2011>=52500 & F3ERN2011<.,1,0) if F3ERN2011!=.

egen

• egen: Extensions to generate (pre-canned generate functions for common data transformation)

• Standardize wage using egen:.egen z_earn2 = std(F3ERN2011)

.summarize z_earn2

• Generate group-level means using egen:.by BYSEX BYRACE, sort: egen mean_earn = mean(F3ERN2011)

.summarize F3ERN2011 if BYSEX==1 & BYRACE==7

Page 19: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

19

Handy Tip: egenmoreThe egenmore user-written ado-file (ssc install egenmore) extends the usefulness of egen by adding an extra 30ish pre-canned functions that can be called through egen. Some (randomly chosen) examples:

• rowmedian calculates the median across observations of the variables specified

• noccur calculates the number of times a substring appears in a string variable

• xtile categories variables by percentiles or quantiles

Factor Notation and Regressions• Stata recently (version 11, I think) introduced factor-

variable notation when estimation models:» The i. prefix designates that you want Stata to treat the

variable as sets of indicator (dummy) variables, so you don't have to create them yourself! So:

.regress F3ERN2011 BYSEX BYRACE

treats gender and race as continuous variables, while.regress F3ERN2011 i.BYSEX i.BYRACE

treats both variables as categorical, estimating a unique intercept for each category» The c. prefix designates variables that should be treated as

continuous (the default assumption if i is not specified).

Factor Notation• You can do some clever things:

ib#. allows you to set the omitted (base) category

.regress F3ERN2011 BYSEX ib7.BYRACE

ib(freq). selects the modal category as base (default is smallest category)

i#. creates a binary dummy from your categorical variable:

.regress F3ERN2011 BYSEX i7.BYRACE

Thereby comparing whites to non-whites (the omitted base category)

Page 20: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

20

Factor NotationThis is also quite handy for incorporating higher order terms and interactions in your model:.regress F3ERN2011 i.BYSEX i.BYRACEc.BYTXRSTD##c.BYTXRSTD

This estimates a model predicting earnings including gender and race as categorical variables. It also includes standardized math test score and its squared term.

.regress F3ERN2011 i.BYSEX##c.BYTXMSTD i.BYRACE

This estimates a model predicting earnings including gender, race, and standardized math score. The interaction of gender and math score is also included, allowing the impact of math score on earnings to differ by gender.

Factor Notation• Another benefit of using factor notation is that Stata is

aware that quadratics are products of lower order terms and interaction terms are products of other variables. This makes calculating predicted values (and other interesting post-estimation statistics) trivial.

• Let's calculate the predicted differences in the earnings of a white, high scoring male vs. a white, high scoring female, using the old way and using factor notation

The Old WayFirst, we create the interaction term:.gen genderXmath = BYSEX*BYTXMSTD

Then, we include it in the regression:.regress F3ERN2011 BYSEX BYTXMSTD genderXmathi.BYRACE

Now, we calculate our fitted value for women:.margins, at(BYSEX=2 BYTXMSTD=63.14 BYRACE=7)

We get a fitted value of $20,341.63. And men:.margins, at(BYSEX=1 BYTXMSTD=63.14 BYRACE=7)

We get a fitted value of $37,811.81.

These results say that men earn $17,540.18 more.

Page 21: Day 1 - University of Kansascrmda.dept.ku.edu/resources/presentations/StatsCamp2015/Day_1_Stata.pdf · Stata Day 1 Jacob Fowles Assistant Professor School of Public Affairs and Administration

8/13/2015

21

The Factor Notation WayRun the regression:.regress F3ERN2011 i.BYSEX##c.BYTXMSTDi.BYRACE

Now, we calculate our fitted value for women:.margins, at(BYSEX=2 BYTXMSTD=63.14 BYRACE=7)

We get a fitted value of $30,880.54. And for men:.margins, at(BYSEX=1 BYTXMSTD=63.14 BYRACE=7)

We get a fitted value of $34,498.93.

These results say that men earn $3618.39 more, on average.

What accounts for this difference?

Remember your Calculus?• The partial effect of gender on earnings in our equation

isn't just the coefficient on gender. It is the coefficient on gender times the product of the coefficient on the interaction term and the value we specify for math test score.

• Doing it the old way, Stata didn't know that the interaction term is the product of the two first-order terms. It just treated it like any other variable and held constant when calculating the predicted values

• Doing it the new way, Stata knew not to hold the interaction term constant, yielding the correct fitted values.