When I think back to the things that most advanced my code writing efficiency in Stata, I always remember my initial forays into using macros (globals and locals) and do loops (foreach and forvalues). I recall making lots of mistakes at first and having to continuously go back and correct my code after I found out that it did not do exactly what I intended. I hardly think this is an uncommon experience, and I believe the initial difficulties in using macros and do loops too often deter people from moving forward from this initial stage of frustration and failure toward a stage of more confident and correct use of these very helpful Stata commands. This is the reason why I incorporated some of the lessons I learned throughout my failures into a “Data Management in Stata” tutorial I administered as a PhD student at Rutgers-Newark (also housed on this website here – Stata Tutorial).

 

In that vein, I decided the first of my blog posts to this website should distill some of these lessons, in the form of an application of macros and do loops to an issue most data analysts will encounter early on in their careers – having to run many models over more than one outcome.

I’ll begin by creating some simulated data in various forms (continuous, dichotomous, and count):

set obs 1000
forvalues i=1/10 {
generate a`i'=rnormal(100,25)
}
forvalues i=1/5 {
generate b`i'=runiform()>=.50
}
forvalues i=1/3 {
generate c`i'=rpoisson(1)
}

Here I have created 1000 observations using the “set obs” function. I then create three sets of variables using a “forvalues” loop: 1) a series of 10 normally distributed variables (the “rnormal” function) with mean 100 and standard deviation of 25, 2) a series of 5 bernoulli-distributed variables (the “runiform” function with a standard .50 threshold for 1s) and, 3) a series of 3 poisson-distributed variables (using the “rpoisson” function set at mean 1).

Important to note is the syntax used for the “forvalues” loops. I first call the function “forvalues”, then specify a stand-in character “i” for the values I want to loop over (“1/10” or 1 through 10) – the first line then ends with an open squiggly brace (“{“). Before I close the loop on a separate line with a closed squiggly brace (“}”) I include the code I want to loop over within the squiggly braces. Within each line, I reference the stand-in character “i” using a grave accent and apostrophe as so – `i’. The framework of the loop stipulates that “i” is first equal to 1, runs all the code within the squiggly braces, then moves the value of “i” to 2 – the loop continues until it reaches its final specified value – 10.

Now, I want to create a few global macros to reference these variables in one place, so that I can invoke them later on. This procedure is particularly helpful when you have to run multiple regressions using the same set of independent variables, but over a set of dependent variables. For the purposes of this example, I will be using the variables a6, a7, a8, a9, and a10 as dependent variables.

global a a1 a2 a3 a4 a5
global b b1 b2 b3 b4 b5
global c c1 c2 c3
global depvar a6 a7 a8 a9 a10

The above lines of code will create four global macros named 1) a, 2) b, 3) c, and 4) depvar that can each be referenced later on by adding a “$” symbol in front of their names, as so:

regress a6 $a $b $c

Since the “regress” function in Stata only allows for the specification for a single dependent variable I only include “a6” in the above example (as opposed to using the global macro I created above – $depvar). However, by specifying “$a $b $c” this will include all variables included in the named global macros I created. This means that the code directly above is equivalent to its long form:

regress a6 a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3

Which contains quite a bit more writing than when using global macros. Imagine writing out this code in long form for many different models – not only would your code be quite long, but it would also increase the opportunity for errors for every additional model you intend to estimate. Further, if written in long form, what happens when you decide to change the variables included in the model? That’s right – every single line would need to be revised, injecting yet another source of error into your analyses.

To regress the same independent variables on the series of dependent variables we have (a6, a7, a8, a9, and a10) we can use a foreach loop in conjunction with the global macro “depvar”:

foreach var in $depvar {
regress `var' $a $b $c
}

The “foreach” function is specified similarly to the “forvalues” function in Stata. You can see here that I still must specify a stand-in character (“var”) to take on some set of values I specify – here, the variable names I have stored in the global macro “depvar”. The code preceding the open squiggly brace is also equivalent to writing:

foreach var in a6 a7 a8 a9 a10 {
regress `var' $a $b $c
}

The benefit here is the same as the previous example using global macros to specify a set of independent variables – if you decide to edit your dependent variables you only need to do so in one place (where you initially define the macro), thereby decreasing the potential sources of error in your code.

I should note that there are various different methods of specifying “forvalues” and “foreach” loops in Stata, and that these are simply my preferred methods that I have put together over time. What I find most helpful about this method is that I define sets of independent and dependent variables early in my do files and, therefore, only need to alter the variables in one section, as opposed to many if I write out each regression in long form.

Last thing – this will obviously fall apart if the sets of independent variables are different across outcomes, but I have developed some other techniques to counter this issue (for a future blog post!).