Statistics- Analyst of Diabetes

26
Keisteria Battle Statistics: Analyst of Diabetes 1

Transcript of Statistics- Analyst of Diabetes

Page 1: Statistics- Analyst of Diabetes

Keisteria Battle

Statistics: Analyst of Diabetes

1

Page 2: Statistics- Analyst of Diabetes

The purpose of this analysis is to describe, analyze, and communicate the diabetes data set obtained from Vanderbilt Department of Biostatistics. Using the website given : http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets .The overall importance of this analysis is many-sided: 1) Perform an exploratory data analysis and investigate the distributions of the variables, 2) investigate the trends of people with diabetes and show which people are more susceptible to getting diabetes based on their health, frame, body type, and age, 3) evaluate any relationship between people frame by gender 4) Assess any relationship between cholesterol gender and weight. The analysis was performed in SAS 9.3 with graphics generated in SAS and modified graphics in Microsoft Word.

The diabetes data set consist of nineteen variables and four hundred and three observations but due to space limitations the data was cut down to eight variables. The data had twelve missing values. The five variables that remained in the reformatted data are cholesterol level, age, weight, gender, and frame. In the reformatted data the variables all had four hundred and three observations except cholesterol and weight which had four hundred and two observations. An abbreviated display of the diabetes data is displayed in Table 1 and the abbreviated variable names used in Table 1 are explained in Table 2. The variables available in the data set are further described in Table 3 where their variable types and measurement units are displayed where it is applicable. The overall importance of the Diabetes dataset is to view the characteristics and traits of people with diabetes and show who are more susceptible to get diabetes.

In the reformatted diabetes data analysis all quantitative and categorical variables were examined. Categorical variables were examined first through frequency tables, pie charts and bar charts. Table 4 and 5 display frequency tables for the categorical variables. Table 4 displays that majority of the people with diabetes are medium frame (N=184 or 47.06%). According to Table 5 most of the people with diabetes are females (N=234 or 58.06%). The distribution of these variables is shown in Figures 1-4 graphically through pie and bar charts. Pie charts tend to be the best way to view the data due to the fact that it is easy to read.

Table 6 displays descriptive statistics for all quantitative variables. Table 6 displays number of observations, mean, median, standard deviation, quartile range, minimum, maximum, lower quartile, upper quartile and range. Figures 5 – 10 display histograms and boxplots for each of these variables. Figure 5 displays that the distribution of individuals with diabetes cholesterol right skewed. The mode is 200 milligrams per deciliter. The minimum value is 78 and max is 403 milligram per deciliter. There is a gap in right before 400 milligrams per deciliter. There is a peak from 179 to 230 milligrams per deciliter. The mean is 207.85 and median 204 milligram per deciliter. Figure 6 displays that the distribution of individuals with diabetes age is right skewed. The mode is 40 years old. The minimum value is 19 and max is 92 years old. There are no gaps. There is a peak from 34 to 60 years of age. The mean is 46.85 and median 45 years of age. Figure 7 displays that the distribution of individuals with diabetes weight is right skewed. The mode is 180 pounds. The minimum value is 99 and max is 325 pounds. There are no gaps. There is a peak from 200 to 226 pounds. The mean is 177.59 and median of the variable weight is 172.50 pounds. Figure 8 displays a boxplot of the cholesterol level in milligrams per deciliter of people with diabetes. Figure 8 shows that there are 9 outliers. Figure 9 is a boxplot of the age in years of people with diabetes. Figure 10 is a boxplot of the weight in pounds of people with diabetes. Figure ten has ten outliers toward the larger pounds.

2

Page 3: Statistics- Analyst of Diabetes

The analysis continued by forming a new categorical variable entitled Body from the combination of the quantitative variable weight. As a result, this new variable differentiated if individuals were under weight, average or obese. According to figure 11, underweight individuals were more susceptible to diabetes. Table 7 show that two hundred and ten of the individuals were underweight (52,11%), then average with one hundred and sixty-six (41.19%) of individuals, and obese comprised the smallest group twenty-seven (6.7% ) of all individuals with diabetes.

Next, a multivariate analysis was performed in order to investigate a possible relationship between individuals with diabetes frame and gender. We hypothesized that a relationship exists with females with larger frames because males tend to be healthier and have better body frames. Table 8 displays a 2-way contingency table with the explanatory variable, gender in rows and the response variable, frame, in columns. The row percentages in this contingency table are the most accurate. Table 8 demonstrates that 51.10% of females with medium frames are more likely to have diabetes, compared to 41.46 % of males with medium frames. Then 37.20% of males with large frames and 30.40% women with same frames. Then 21.34% of males with small frames and 18.50% of women with large frames. These results are depicted in a 100% stacked bar chart in Figure 12. With both table 8 and figure 12 together these results suggest females with medium frames are more susceptible to getting diabetes, therefore opposing the research hypothesis: there appears to be a relationship between frame and gender where females with larger frames are more susceptible to getting diabetes.

Another multivariate analysis was performed to assess a possible relationship between gender and weight. We hypothesized that males with large weights and females with large weights are more susceptible to getting diabetes. A side-by-side histogram or grouped histogram and boxplots were created to get the distribution of gender by weight of individuals with diabetes. Figures 13 and14 show that these distributions are in fact unimodal and right skewed. Therefore,

95% confidence intervals for the median values of gender by weight were generated. Table 9

visually displays a 95% confidence interval of weight by gender. We can be 95% confident that true and substantial differences exist among gender and weight of people susceptible to get diabetes are similar to people with diabetes. The mean weight for males is between 155.34 and 243.77 pounds. The mean weight for females is between 161.38 and 199.51 pounds. Therefore, based on this data, we can estimate the true weight by gender is with males and females with average weights with 95% confidence. The side-by-side boxplots in Figure 14 further provide evidence of which weight and gender is more susceptible of getting diabetes. All of these results oppose the research hypothesis that males with large weights and females with large weights are more susceptible to getting diabetes.

A final multivariate analysis was performed to investigate a possible relationship between weight and cholesterol. We hypothesized that individuals with larger weights and low cholesterol levels are more likely to get diabetes. Figure 15 presents a scatterplot with the explanatory variable, cholesterol, on the x-axis and the response variable, weight, on the y-axis. The scatterplot shows most of the data is clustered between 150 milligrams per deciliter and 300 milligrams per deciliter and 100 to 200 pounds. Therefore, these results do not support the research hypothesis.

3

Page 4: Statistics- Analyst of Diabetes

Appendix I: SAS Tables and Figures

4

Page 5: Statistics- Analyst of Diabetes

Table 4: Frequency Table For Frame

Frame Frequency PercentCumulativeFrequency

CumulativePercent

Large 103 26.34% 103 26.34%

Medium 184 47.06% 287 73.40%

Small 104 26.60% 391 100.00%

Table 5: Frequency Table For Gender

Gender Frequency PercentCumulativeFrequency

CumulativePercent

Female 234 58.06% 234 58.06%

Male 169 41.94% 403 100.00%

5

Page 6: Statistics- Analyst of Diabetes

6

Page 7: Statistics- Analyst of Diabetes

Figure 3: Pie Chart for Frame (n=403)PERCENT of Frame

Large

Figure 4: Pie Chart for Gender (n=403)PERCENT of Gender

7

Page 8: Statistics- Analyst of Diabetes

Table 6: Descriptive Statistics of All Quantitative Variables

Variable Label N MeanMedia

n

Standard

Deviation

Quartile

RangeMinimu

mMaximu

m

Lower Quartil

e

Upper Quartil

e Range

CholesterolAgeWeight

CholesterolAgeWeight

402403402

207.8546.85

177.59

204.0045.00

172.50

44.4516.3140.34

51.0026.0049.00

78.0019.0099.00

443.0092.00

325.00

179.0034.00

151.00

230.0060.00

200.00

365.0073.00

226.00

8

Page 9: Statistics- Analyst of Diabetes

9

Page 10: Statistics- Analyst of Diabetes

10

Page 11: Statistics- Analyst of Diabetes

11

Page 12: Statistics- Analyst of Diabetes

Table 7: Frequency Table For Body

BODY Frequency PercentCumulativeFrequency

CumulativePercent

AVERAGE 166 41.19% 166 41.19%

OBESE 27 6.70% 193 47.89%

UNDER WEIGHT 210 52.11% 403 100.00%

Table 8: Contingency Table For Frame by Gender

Gender Frame

Frequency PercentRow PercentCol Percent Large Medium Small Total

Female 4210.74%18.50%40.78%

11629.67%51.10%63.04%

6917.65%30.40%66.35%

22758.06

Male 6115.6037.2059.22

6817.3941.4636.96

358.95

21.3433.65

16441.94

Total 10326.34

18447.06

10426.60

391100.00

Frequency Missing = 12

12

Page 13: Statistics- Analyst of Diabetes

Figure 12 : 100% Stacked Bar Chart of Frame by Gender (n=403)

Frame Large Medium Small

PERCENT

0

10

20

30

40

50

60

70

80

90

13

Page 14: Statistics- Analyst of Diabetes

14

Page 15: Statistics- Analyst of Diabetes

Table 9: 95% Confidence Interval for Weight by Gender

GenderN

Obs N Mean MedianLower 95%

CL for MeanUpper 95%

CL for Mean

Female 36 36 180.44 170.50 161.38 199.51

Male 9 9 199.56 182.00 155.34 243.77

15

Page 16: Statistics- Analyst of Diabetes

In conclusion, an exploratory data analysis was executed and that the quantitative variables in the diabetes data set were right skewed, the median was the best measure of central tendency for these variables. Then we constructed an investigation on possible relationship between frame by gender and found evidence that opposed our hypothesis that females with larger frames are more susceptible to getting diabetes. We also found a difference exist between gender by weight and evidence that opposed our hypothesis of: males with large weights and females with large weights are not more susceptible to getting diabetes. Evidence show that male and female with average bodies was more susceptible to getting diabetes. Finally, we found a relationship between weight and cholesterol. We hypothesized that individuals with larger weights and low cholesterol levels are more likely to get diabetes and the results did not support the research hypothesis. For future research, we suggest a formal statistical analysis of cholesterol by age while controlling for age and gender. This type of analysis might show a relationship that was not shown in this investigation. This has been a preliminary analysis of the Diabetes data.

16

Page 17: Statistics- Analyst of Diabetes

Appendix II: SAS Code

*Battle, STAT 3010, FINAL PROJECT: ANALYSIS OF diabetes DATA;LIBNAME W1 '\\Client\E$\Final project\diabetes.xls';*IMPORTING Diabetes DATA;DM 'LOG;CLEAR;OUT;CLEAR;'; OPTIONS LS=90 PS=66 FORMDLIM="="; QUIT;*importing data.:;

PROC IMPORT OUT= WORK.diabetes DATAFILE= "\\Client\E$\Final project\turn in final project\diabetes.xls" DBMS=EXCEL REPLACE; RANGE="diabetes$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES;RUN;*view data;Proc Print data = diabetes;Run;

*dropping unused variables;proc sql;create table Kb asselect *from diabetes(drop=id stab_glu hdl ratio glyhb location height bp_1s bp_1d bp_2s bp_2d waist hip time_ppn );quit;*reformatting Kb * viewing the new table that has 5 variables.;Proc Print data = Kb;Run;*SAVE THE NEW DIABETES AS A PERMANENT DATASET;*ESTABLISHING LIBREF;LIBNAME W1 '\\Client\E$\Final project\Battle_Keisteria_final Project.sas';DATA WORK.Kb;

SET W1.Kb;RUN;*ods of diabetes first 7 and last 3 rows ; *question 5:;

ODS RTF;

17

Page 18: Statistics- Analyst of Diabetes

Data First;

set diabetes (obs=7);

Run;

Data Last;

set diabetes (Firstobs=400);

Run;

*Combines the two data tables made into 1 complete table;

Data Table;

set first last;

run;ODS RTF Close;*Frequency tables for frame variables;ODS RTF;Proc Freq data=Kb;

Table frame;title Table 4: Frequency Table For Frame; Run;ODS RTF CLOSE;*Frequency tables for gender variables;ODS RTF;Proc Freq data=Kb;

Table gender;title Table 5: Frequency Table For Gender; Run;ODS RTF CLOSE;*Bar Chart;ODS RTF;Proc SGPLOT Data=Kb;

TITLE 'Figure 1: Bar Chart for Frame (n=403)';VBAR frame;

Run; Quit;ODS RTF CLOSE;ODS RTF;Proc SGPLOT Data=Kb;

TITLE 'Figure 2: Bar Chart for Gender (n=403)';VBAR gender;

18

Page 19: Statistics- Analyst of Diabetes

Run; Quit;ODS RTF CLOSE;

***Pie chart;*for frame;ODS RTF ;Proc GCHART Data=Kb;

TITLE 'Figure 3: Pie Chart for Frame (n=403)';PIE Frame / TYPE = PCT;LEGEND;

Run; Quit;ODS RTF CLOSE;*for gender;ODS RTF ;Proc GCHART Data=Kb;

TITLE 'Figure 4: Pie Chart for Gender (n=403)';PIE Gender / TYPE = PCT;LEGEND;

Run; Quit;ODS RTF CLOSE;*Descriptive Statistics for all of the Quantitative Variables;ods rtf;Proc Means data=Kb MAXDEC=2 N MEAN MEDIAN STD QRANGE MIN MAX Q1 Q3 RANGE;

title Table 6:Descriptive Statistics of All Quantitative Variables;Run;ODS RTF Close;*Histograms;*Just a histogram for Just Cholesterol;ODS RTF;PROC sgplot DATA = Kb;

TITLE 'Figure 5: Histogram of Cholesterol (n=402)';xaxis label= "Cholesterol (in Milligram per Deciliter)";

HISTOGRAM Cholesterol;RUN;ODS RTF Close;*Just a histogram for Just Age;ODS RTF;PROC sgplot DATA = Kb;

TITLE 'Figure 6: Histogram of Age (n=403)';xaxis label= "Age (in Years)";

HISTOGRAM Age;RUN;ODS RTF Close;*Just a histogram for Just Weight;ODS RTF;

19

Page 20: Statistics- Analyst of Diabetes

PROC sgplot DATA = Kb;TITLE 'Figure 7: Histogram of Weight (n=402)';

xaxis label= "Weight (in Pounds)";HISTOGRAM Weight;

RUN;ODS RTF Close;*BOXPLOTS;*boxplot for cholesterol;ODS RTF ;PROC SGPLOT DATA = Kb;TITLE 'Figure 8: Boxplot of Cholesterol (n=402)';

VBOX Cholesterol;YAXIS LABEL = 'Cholesterol (in Milligram per deciliter)';

RUN;ODS RTF Close;*boxplot for age; ODS RTF ;PROC SGPLOT DATA = Kb;TITLE 'Figure 9: Boxplot of Age (n=403)';

VBOX age;YAXIS LABEL = 'Age (in Years)';

RUN;ODS RTF Close;*boxplot for weight;ODS RTF ;PROC SGPLOT DATA = Kb;TITLE 'Figure 10: Boxplot of Weight (n=402)';

VBOX weight;YAXIS LABEL = 'Weight (in Pounds)';

RUN;ODS RTF Close;/* Create a new categorical variable for weight*/

DATA Kb;SET Work.Kb;IF weight < 174.33 THEN BODY='UNDER WEIGHT';IF weight >=174.33<249.66 THEN BODY='AVERAGE'; IF weight >= 249.66 THEN BODY='OBESE';RUN;

* Print the data file KB;

PROC PRINT DATA=KB;TITLE 'WEIGHT DATA';RUN;*frequency table for new variable(body);

20

Page 21: Statistics- Analyst of Diabetes

ODS RTF;Proc Freq data=Kb;

Table body;title Table 7: Frequency Table For Body; Run;ODS RTF CLOSE;*bar chart for new variable(body);ODS RTF;Proc SGPLOT Data=Kb;

TITLE 'Figure 11: Bar Chart for Body (n=403)';VBAR body;

Run; Quit;ODS RTF CLOSE;

*********mulitivariate**************analysis on frame by genger;

*Contingency Tables for Frame by Gender;ODS RTF;Proc Freq DATA = Kb;

Tables Gender*frame; Table 8: Contingency Table For Frame by Gender;

Run;ODS RTF CLOSE;*100% STACKED BAR CHARTS;ODS RTF;Proc GCHART Data=Kb;

TITLE 'Figure 12 : 100% Stacked Bar Chart of Frame by Gender (n=403)';VBAR Gender / SUBGROUP = frame TYPE=PCTGROUP =Gender NOZERO G100 GASXIS=AXIS1;

Run; Quit;ODS RTF Close;DM 'LOG;CLEAR;OUT;CLEAR;';

*side-by-side histogram;

Proc Sort Data=Kb; BY Gender;

Run;

*CREATE THE HISTOGRAM;ODS RTF;TITLE 'Figure 13:Grouped Histogram of Gender by Weight (n=403)';PROC SGPLOT Data=Kb;

Histogram weight;

21

Page 22: Statistics- Analyst of Diabetes

BY Gender;XAXIS LABEL = "Weight (in Pounds)" ;

Run;ODS RTF Close;

*SIDE-BY-SIDE BOXPLOTS Weight by gender;ODS RTF;Proc Sort Data=Kb;

BY gender;Run;

PROC SGPLOT DATA = Kb;TITLE 'Figure 14: Side-by-Side Boxplot of Weight by Gender (n=403)';

VBOX weight / CATEGORY=gender;YAXIS LABEL = 'Weight (in Pounds)' VALUESHINT MIN=90 MAX=330;

RUN;ODS RTF Close;*create simple random sample;Proc Surveyselect data=Kb out=Srs Method=SRS

Sampsize=45 Seed=24;Run;*view srs;Proc Print data=Srs;Run;*Calculate Descriptive Statistics for the Quantitative Variables;proc means data=Srs N Mean Median Min Max STD QRange maxdec=2;Run;

proc univariate data=Srs;histogram;Run;*95% Confidence interval table;Proc Sort Data=Srs;

BY Gender;Run;ODS RTF;proc means data=Srs N MEAN MEDIAN CLM alpha=0.05 maxdec=2;

VAR Weight;class Gender;title Table 9: 95% Confidence Interval for Weight by Gender;

Run;ODS RTF Close;*scatterplot of weight by cholesterol;ODS RTF ;Proc sgplot data=Kb;Title 'Figure 15: Scatterplot of Weight By Cholesterol (n=402)';

22

Page 23: Statistics- Analyst of Diabetes

scatter x = cholesterol y = weight;xaxis Label='Cholesterol (in Milligram per Deciliter)';yaxis Label='Weight (in Pounds)';REG x = cholesterol y =weight; *Adds the regression line;

Run;ODS RTF Close;

23