How To Separate Data In Stata Using A Dummy Variable

How-to Guide for Stata

Introduction

In this guide, yous will learn how to estimate a multiple regression model with dummy variables in Stata using a practical example to illustrate the procedure. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you lot take already opened the data file in Stata.

Contents

i.
Multiple Regression With Dummy Variables
2.
An Example in Stata: Weight and Existence a Non-Smoker
- 2.i The Stata Procedure
- two.2 Exploring the Stata Output
3.
Your Turn

one Multiple Regression With Dummy Variables

Multiple regression expresses a dependent, or response, variable as a linear office of two or more contained variables. Readers looking for a full general introduction to multiple regression should refer to the appropriate examples in SAGE Research Methods. This example focuses specifically on including dummy variables among the contained variables in a multiple regression model.

When 1 (or more than) of the contained variables is a chiselled variable, the about mutual method of properly including them in the model is to code them as dummy variables. Dummy variables are dichotomous variables coded as 1 to point the presence of some attribute and as 0 to indicate the absence of that attribute.

ii An Example in Stata: Weight and Being a Non-Smoker

This example uses several variables from the 2012 General Social Survey:

The respondent's weight (rweight), measured in pounds (the dependent variable).
The respondent'south height (rheight), measured in inches.
Whether the respondent is female (female person), coded i = Yeah and 0 = No.
The respondent's age (historic period), coded in years.
The respondent's age squared (age2), which is merely historic period in years squared.
The respondent'south family income (income), coded into categories from i to 25.
Whether the respondent is a non-smoker (nosmoke), coded 1 = Yes and 0 = No.

The sample dataset includes one,351 respondents. The average weight of survey respondents is just over 178 pounds, while the boilerplate height is well-nigh 67 inches. Most 55% of the respondents are female, with an average age of almost fifty years. The median income falls between $40,000 and $49,000 per year. Turning to the independent variable of interest, nearly 76% of respondents are non-smokers, leaving 24% who do smoke.

two.1 The Stata Procedure

When conducting a multiple regression, it is oft wise to examine the dependent variable in isolation first. Summary statistics for all variables can be compiled using the summarize command, followed by a list of the variables of interest. In the case of the dependent variable, enter the post-obit control in the Stata Command window:

summarize rweight

Printing Enter to produce summary statistics detailing the number of observations, mean, standard deviation, and range.

Next, we nowadays a histogram of weight. This is done in Stata by entering the post-obit command in the Control window:

histogram rweight

Press Enter to produce a histogram. By default, Stata will produce a density histogram. To select frequency, enter the post-obit command instead:

histogram rweight, frequency

Alternatively, yous can create a histogram by selecting options from the Menu as follows:

Graphics → Histogram

In the histogram dialog box that opens, you will see a textbox labeled "Variable" in the upper left-hand corner. Use the drop-down card to select rweight from the list of variables. To the right of the "Variable" box, y'all will run across 2 buttons asking you to specify whether data are detached or continuous. Ensure that the "Data are continuous" option has been selected. In the lower right-manus corner under "Y axis," select "Frequency," and Click OK to perform the analysis.

Screenshots for the procedure to produce histograms in Stata are bachelor in the How-to Guides for the Dispersion of a Continuous Variables topic that is part of SAGE Research Methods Datasets.

Nosotros recommend exploring each independent variable too, but nosotros do not do so here in the interest of space.

You approximate a multiple regression model in Stata past inbound the regress control in the Command window, followed by the dependent variable outset (rweight) and the independent variables thereafter. The control is as follows:

backslide rweight rheight female age age2 income nosmoke

Printing Enter to run the analysis.

Inbound the command equally above into the Stata Command window is the simplest way to conduct out this estimation. However, the multiple regression model tin also be estimated by using the Carte du jour options as follows:

Statistics → Linear models and related → Linear regression

In the regress Linear Regression dialog box that opens, two text boxes are provided to specify the dependent and independent variables to include in the model. In the "Dependent variable" box, select rweight from the drop-down menu. In the "Contained variables" text box, select rheight, female, historic period, age2, income, and nosmoke.

Once y'all are done, click OK to perform the analysis.

Figure 1 shows what the dialog box looks like in Stata.

Effigy ane: Selecting Multiple Regression From the Statistics Menu in Stata.

2.ii Exploring the Stata Output

We offset past presenting the histogram for the weight variable in Figure 2.

Figure ii shows that the majority of values for weight autumn near the hateful of 178. Very few respondents report weights below 100, just a substantial number of respondents report weights of 200–250 pounds. A handful of respondents report weights of 300 pounds or greater. There is a slight skew to the distribution as the positive tail does include some larger values, but overall, the distribution is sufficiently close to normal so equally not to raise whatever concerns.

Figure 2: Histogram Showing the Distribution of Respondent Weight Measured in Pounds, 2012 General Social Survey.

Next, we consider the results of the multiple regression. This procedure in Stata produces a tabular array of results, which are presented in Figure three.

Figure 3: Multiple Regression Model With Respondent Weight Regressed on a Range of Independent Variables, Including Whether the Respondent Is a Non-Smoker, 2012 Full general Social Survey.

The top section of the tabular array provides an analysis of variance for the model as a whole. While these results are not the focus of this instance, nosotros note that the R-Squared effigy reported to the upper correct of the table measures the proportion of the variance in the dependent variable explained past the model. An R-Squared of .267 ways that about 26.7% of the variance in weight is accounted for by the independent variables in this model.

The bottom department of the table presents the estimates of the intercept, or constant (_cons), and the gradient coefficients. For this case, we focus on the results associated with being a non-smoker. The coefficient estimate is 12.979 and is statistically significant. This ways that a one-unit increase in the non-smoker dummy variable is associated with an average increase in weight of nearly 13 pounds. A one-unit increase in the non-smoker dummy variable is the divergence betwixt being a smoker (nosmoke = 0) or a non-smoker (nosmoke = 1). In other words, later controlling for the effects of the other variables in the model, the average difference in weight between smokers and non-smokers is about xiii pounds.

The remaining results in the tabular array suit to what nosotros probable would have expected in terms of the coefficient estimates being positive or negative, and all of the estimated fractional slope coefficients clearly reach conventional levels of statistical significance.

In that location are multiple diagnostic tests researchers might perform post-obit the estimation of a regression model to evaluate whether the model appears to violate any of the OLS assumptions or whether there are other kinds of problems such equally particularly influential cases. Describing all of these diagnostic tests is beyond the scope of this example.

3 Your Plough

Download this sample dataset to come across whether you can replicate these results. And then, repeat the assay this time estimating the model separately for male and female respondents.