– Research Methods & Design – Course Links



  – Research Methods & Design – Course Links

0 0


research_methods_slides

slides for my research methods course at University of Waterloo

On Github mclevey / research_methods_slides

Research Methods

Dr. John McLeveyUniversity of Waterloojohn.mclevey@uwaterloo.cajohnmclevey.comnetworkslab.org/metaknowledge/Winter 2016, Knowledge IntegrationUniversity of Waterloo

These are html slides! You can navigate them using the arrow keys on your computer or the arrows on the bottom right of the screen. The slides are nested. Scroll through the major sections of the course horizontally, and access content vertically. Get an overview of the slides by pressing ESC. Make the slides fullscreen by pressing F.

Design and Learning Objectives

Why are you here?

What makes this a good research methods course for a KI (or KI sympathetic) student?

an emphasis on the logic of multiple approaches an emphasis on mixed methods a practical orientation that prioritizes real world problems hands on and project-driven learn how to do reproducible and open science from the start small and collaborative a foundation for future learning and personal development

By the end of the class, students should be able to:

understand the logic of quantitative, experimental, relational, and qualitative research formulate empirical research questions and testable hypotheses explain the importance of probability samples in quantitative research identify patterns in data understand the assumptions of regression models design high-quality survey questions and interview guides do reproducible work have a solid foundation for future learning about research methods

Deliverables

Everything except the TCPS2 training, the comprehension quizzes, and participation are collaborative.

Assignment Deadline % TCPS2 Training 3 article reviews Quant data challenge Networks data challenge Qual data challenge 5 quizzes Major research project Engagement January 11 Day before class February 10 March 7 March 23 J20, F10, 22, M7, 23 End of term Ongoing NA 15 15 15 15 10 25 5

Collaborative Work

You are permitted to: 

* work together on all assignments, challenges, and projects 
* divide labor such that some students code more than others
* re-use some code that was written by other people, including other students
But:

* everyone is responsible for learning the content, and will have to complete comprehension quizzes on their own
* everyone has to write at least *some* code
* everyone is responsible for understanding what their R scripts are doing

software

R

  • is free / libre / open source
  • is a specialized programming language, but there is a fantastic IDE (RStudio) that will make your life easier
  • has state of the art libraries for anything you need to do
  • has much higher quality graphics than alternatives
  • has the best integration of writing tools, data, code, etc. by far
  • works on almost any machine, even machines running Windows!

There is a learning curve, but I honestly believe it is lower than SPSS and other proprietary software. Besides, learning curves are acceptable...

R Markdown (and Pandoc)

  • combines your computing and your written analysis in one text file so that you never have to copy and paste results
  • is easy to learn
  • provides very large payoffs for very little work
  • is faster and simpler than LaTeX
  • enables you to use a few simple tools that make the process of writing about quantitative research almost pleasurable
  • facilitates "thinking with data"
  • is revolutionizing the way that the best science is done. seriously.

Submitting Work

Never give me a Microsoft Word document. Instead, give me R Markdown files and PDFs in #slack and Learn.

Communication

We will use slack!

w2016-integ275.slack.com

sign-up here:

https://w2016-integ275.slack.com/signup

Studying

Advice

read, re-read, re-read you are bright, but expect some things to be very challenging take notes on core concepts and their relationships to one another review your notes frequently think about the concepts on and off during throughout the day don't separate the readings from the hands on learning

Taking Notes

take notes in RMarkdown files, not Microsoft Word documents RStudio will provide an RMarkdown template for you to get started with you can customize RMarkdown files endlessly, but the defaults are mostly fine I will provide you with RMarkdown files most but not all of the time don't keep everything on your Desktop. have an organized folder structure.

Thinking Quantitatively

Survey design and quality

You have ~ 5 minutes to complete the survey in front of you.

What's wrong with this survey?

Key Issues in Survey Design and Quality (Goves et al. Reading)

Two "inferential steps" in survey methodology

between the questions you ask and the thing you actually want to measure between the sample of people you talk to and the larger population you care about

Errors are not mistakes, they are deviations / departures from desired outcomes or true values.

Errors of observation are when there are deviations between what you ask and what you actually want to measure.

Errors of non-observation are when the statistics you compute for your sample deviate from the population.

Surveys from the design perspective

You move from the abstract to the concrete when you design surveys. "Without a good design, good survey statistics rarely result." You need forethought, planning, and careful execution.

Next Class

Surveys from a Quality Perspective!

Finish Reading Grove et al. pages 49-63 Browse the Statistics Canada Website and come up with a short list of surveys of Canadians that you find personally interesting. What are they about? What are types of questions they ask? Put your list in the #general channel on slack.

Reminder

There are two "inferential steps" in survey methodology

between the questions you ask and the thing you actually want to measure between the sample of people you talk to and the larger population you care about

Errors of observation are when there are deviations between what you ask and what you actually want to measure.

Errors of non-observation are when the statistics you compute for your sample deviate from the population.

$\mu_i$ = value of a construct for the $i$th person in the population, $i$ = 1, 2, 3, 4 ... N

$Y_i$ = value of a measurement for the $i$th sample person

$y_i$ = value of the response to the application of the measurement

$y_{ip}$ = value of the response after editing and processing steps

We are trying to measure $\mu_i$ using $Y_i$, which will be imperfect due to measurement error. When we apply the measurement $Y_i$ (e.g. by asking a survey question), we actually obtain $y_i$. This is due to problems with administration. Finally, we try to mitigate these errors by making final edits, resulting in $y_{ip}$.

The measurement equals the true value plus some error term ($\epsilon_i$).

$Y_i = \mu_i + \epsilon_i$

The answers you provide on a survey are inherently variable. Given so many "trials," you might not provide the same answers. In theory, there could be an infinite number of trials! We can use another subscript $_t$ to denote the trial of the measurement. We will still use $_i$ to represent each element of the population (e.g. the person completing the survey).

$Y_{it} = \mu_{i} + \epsilon_{it}$

Q: Have you ever, even once, used any form of cocaine?

Survey respondents tend to under report behaviors that they perceive of as being undesirable. Even if the answer is yes, they may answer no.

What if the discrepancy between responses and the true value is systematic?

If response deviations are systematic, then we have response bias, which will cause us to under-estimate or over-estimate population parameters. If they are not systematic, we have response variance, which leads to instability in the value of estimates over trials.

Coverage Error

There are people in the population that are not in our sampling frame, and there are people in our sampling frame that are not in our population.

:(

Sampling Error

If there are some members of the sampling frame that are given no, or reduced, chance of inclusion, then we have sampling bias. They are systematically excluded. Sampling variance is not systematic and is due to random chance.

We will discuss this more next week, but the extent of error due to sampling is a function of 4 things:

whether all sampling frame elements have known non-zero changes of selection into the sample (probability sampling) whether the sample is designed to control the representation of key sub-populations in the sample (stratification) whether individual elements are drawn directly and independently or in groups (cluster samples) How large a sample of elements is selected

$\bar{Y}$ = mean of the entire target population

$\bar{Y}_C$ = mean of the population on the sampling frame

$\bar{Y}_U$ = mean of the population not on the sampling frame

N = total number of members in the target population

C = total number of eligible members on the sampling frame

U = total number of eligible members not on the sampling frame

If the values of our statistics computed on the respondent data differs from the values we would get if we computed statistics on the entire sample data, then we have non-response bias.

$\bar{y}_s$ = mean of the entire sample as selected

$\bar{y}_r$ = mean of the respondents within the $s$th sample

$\bar{y}_n$ = mean of the nonrespondents within the $s$th sample

$n_s$ = total number of sample members in the $s$th sample

$r_s$ = total number of respondents in the $s$th sample

$m_s$ = total number of nonrespondents in the $s$th sample

$s$th sample? Yup. Conceptually this is similar to the idea of trials. The sample we draw is one of many that we might possibly have drawn. It's one single realization.

Adjustment Error

We make postsurvey adjustments to mitigate the damage of the types of errors we just discussed. Sometimes we introduce new errors.

You may have noticed

If a source of error is systematic, we call it bias. If it is not systematic, we call it variance. Most errors probably contain both biases and variances.

We can represent all of this with nice compact notation. In most cases, capital letters stand for properties of population elements and are used when we are talking about measurement and when sampling the population is not an issue. If we are drawing inferences about a population by using a sample, capital letters are for population elements and lower case are for sample quantities. Subscripts indicate membership in subsets of the population (e.g. $_i$ for the $i$th person).

Next Class

Populations & Samples

Read Babbie and Benaquisto (2014) pages 158-199

Make sure you budget enough time for this reading!

Populations and samples

There are two general types of samples: non-probability samples and probability samples.

Non-Probability Samples

  • relying on available subjects
    • e.g. stopping people at a street corner. this is extremely risky because you have no control over the representativeness of the sample. do not to this!
  • purposive sampling
    • you select people based on your own knowledge of the population
    • the goal is typically not to have a sample that is representative of a broader population
  • snowball sampling
    • members of a population are difficult to locate / identify
    • you find a few people, and then ask them to pass along information to people they know
  • quota sampling
    • you have a matrix describing key characteristics of the population
    • you sample people who share the characteristics of each cell in the matrix
    • e.g. you try to assign equal proportions of people who belong to different groups to your sample (e.g.,if you know that 10% of all classics majors are female and international, then you select 10 female international students for a sample of 100 classics majors)
    • it can be hard to get up to date information about the characteristics of the population, and tend to be high rates of sampling bias
  • selecting informants
    • an informant is a member of a group who is willing to share what they know about the group
    • informants are not the same as respondents, who are typically answering questions about themselves
    • often used in field research

Why do people use non-probability samples?

In quantitative studies, people use non-probability samples when (a.) it's the best that they can do under difficult circumstances, (b.) they don't know any better, or (c.) they actually don't care about the survey or the data. In qualitative studies, the entire logic of research is different. For one, research strategies tend to be more informed by case study methodology and systematic logical comparisons. Because the intellectual goals are different, different sampling strategies may be used. However, there are always important limitations to keep in mind. We will talk more about this later.

Probability Samples

We use probability samples because we want precise statistical descriptions of large populations. We need probability sampling because populations are very diverse. The goal is to create samples that include the same variations that exist in the population. Probability sampling helps us do this, and enables us to estimate the representativeness of our sample.

Samples are biased if they are not representative of the population they have been chosen from (recall: the representation pillar of survey design and quality).

Figure from Babbie and Benaquiso page 168

The goal is to define a population, produce a sampling frame, and then sample elements (i.e. people) from the sampling frame in a way that reflects variation in the population. Probability sampling enhances the likelihood of achieving this. Random selection is key to this process.

Random Selection

Every element has an equal chance of being sampled independent of any other event in the selection process. (Think about a coin toss. It doesn't matter how many times it comes up tails...)

Typically, random selection is done using computer programs that randomly select elements.

Among other things, random selection ensures that the procedure is not biased by the researcher, who may otherwise (intentionally or not) choose cases that support her hypotheses. But random selection also provides access to probability theory, which we can use to estimate the accuracy of the sample.

Probability, Sampling Distributions, and Estimates of Sampling Error

Recall:

Parameter: The summary description of a given variable in a population (e.g. mean income, mean age)

Statistic: The summary description of a variable in a sample, used to estimate a population parameter.

Sampling Distributions

There are 17 students in the class. Let's say we want to take samples of 5 and use them to estimate the mean height. How many possible samples of 5 are there? 6188.

Probability Theory and the Central Limit Theorem

If you take random samples over and over and over and over again, they will converge on the true value.

We will talk more about this later.

That was the theoretical model for sampling. Of course, in the real world, things are messy and complicated.

Types of Sampling Designs

  • Simple Random Sampling
  • Systematic Sampling
  • Stratified Sampling
  • Multi-Stage Cluster Sampling
  • Multi-Stage Designs and Sampling Error
  • Stratification in Multi-Stage Cluster Sampling
  • Probability Proportionate to Size
  • Disproportionate Sampling and Weighting

Some Realities of Sampling

You have been contracted by a Southern Ontario university to develop a new brand. You need to survey the population of students, grad students, staff, faculty, and other employees.

What do you do? Why?

Central tendency and variability

mean

  • sum all values and then divide by the number of values
  • misleading when their are extreme values (e.g. salary data)

median / 50th percentile

  • sort the values from high to low or low to high and take the middle value
  • if there is an even number of values, take the number that is midway between the two values in the middle
  • not sensitive to extremes

mode

The value that appears the most often.

Variation

range

  • the lowest value and the highest value

standard deviation

  • measures variation in a set of values relative to the mean
  • the bigger the standard deviation, the more variability relative to the mean
  • standard deviation has the same units of measure as the data
  • unless a distribution is highly skewed, approximately 2/3 of the values will fall within 1 standard deviation of the mean

R

numbers <- c(1,2,3,2,4,4,7,9,37,3)

mean(numbers)
median(numbers)
range(numbers)
sd(numbers)

Visualizing Distributions

histograms display the distribution of continuous data only. you bin the data by dividing the range into a set of intervals, then count how many values fall into each interval. The intervals must be adjacent, and are typically equal in size.

boxplots display a five number summary: minimum, first quartile, median, third quartile, and maximum

let's look at some boxplots! :)

TODO: Update Grades Examples

Hypothesis testing

Most social research, especially quantitative research, is about testing hypotheses. It is about making decisions using data. Typically, we want to test hypotheses about differences between groups. E.g.:

Do Germans differ from Canadians with respect to obedience to authority? Do Protestants or Catholics have higher rates of suicide? Do political conservatives discipline their children more severely than political liberals?

Each involves a comparison between two groups.

The Null Hypothesis

The null hypothesis is that any difference between the means in our sample is the product of *sampling error* alone. It is assumed to be true, and statistical evidence is required to reject it in favor of a research hypothesis. Typically, the hypothesis is labeled $H_0$.

When testing hypotheses about differences between groups, the notation is:

$\mu_{1} = \mu_{2}$

Where $\mu_{1}$ = the mean of the first population

Where $\mu_{2}$ = the mean of the second population

In other words, it is possible that we find a difference between means in the sample. The null hypothesis is that there is no difference between the true values, and that the observed difference is the result of sampling error. We "retain" the null hypothesis if we are unable to reject it. That does not mean we think it is true.

Our null hypotheses might look like these:

$H_0$ Germans are no more or less obedient to authority than Canadians.

$H_0$ Protestants have the same suicide rates as Catholics.

$H_0$ Political conservatives and liberals discipline their children to the same extent.

The research hypothesis is that is a true difference between the population means. We accept the research hypothesis if we can reject the null hypothesis. The notation is:

$\mu_{1} \neq \mu_{2}$

Where $\mu_{1}$ = the mean of the first population

Where $\mu_{2}$ = the mean of the second population

$H_a$ Germans differ from Canadians with respect to obedience to authority.

$H_a$ Protestants do not have the same suicide rates as Catholics.

$H_a$ Political conservatives differ from liberals discipline with respect to permissive child-rearing methods.

We number the research hypotheses. E.g. $H_1$, $H_2$, $H_3$, etc.

This is analogous to a court of law. The null hypothesis ($H_0$) is that the defendant is innocent, and the research hypothesis ($H_a$) is that the defendant is guilty. We require evidence to reject the null hypothesis in the same way that we require evidence to convict.

If we do not require very much evidence to convict, then we will probably increase the number of guilty people convicted, but we will also increase the number of innocent people convicted (this false positive is a Type I error, discussed later). On the other hand, if we require a lot of evidence to convict, we will convict fewer innocent people, but we will also convict fewer guilty people (this false negative is a Type II error, discussed later).

Back to Sampling Distributions

If the null hypothesis is true, then the true value of the difference in means between two groups is 0. If we were to sample each group over and over again, compute the means, and then take the difference between them, we would get a sampling distribution with the differences converging on 0.

We know that 68.26% of differences will fall within 1 standard deviation of 0, 95.44% within 2 standard deviations, and 99.74 within 3 standard deviations.

If the difference between means in our sample falls so far from 0 that it has a very low probability of occurrence in the sampling distribution of differences between means, we reject the null hypothesis. If the value falls close enough to 0 that there is a high probability of occurrence, we must consider that the difference is the result of sampling error and retain the null hypothesis.

Statistical Significance

$\alpha$ is the level of probability that we decide we can reject the null hypothesis at.

conventions:

$\alpha$ = > .05$\alpha$ = > .01

$P$ is the exact probability of obtaining the sample data if the null hypothesis is true. If it falls below the $\alpha$ value we select, we reject the null hypothesis.

$\alpha$ = > .05 means there is a 5% chance of a Type I error. $\alpha$ = > .01 means there is a 1% chance of a Type II error.

Type I vs. Type II Errors

Type I: We rejected the null hypothesis when we should not have

Type II: We did not reject the null hypothesis when we should have

Therefore, there are four possible outcomes from this decision making process:

Truth Decide Result $H_0$ $H_0$ Correctly accept null $H_0$ $H_a$ Type I error $H_a$ $H_a$ Correctly reject null $H_a$ $H_0$ Type II error

Remember, statistical significance $\neq$ importance.

Scatterplots, correlation, linear regression

Most really interesting data analysis is about relationships between and across variables.

If we think that education affects income, we would call education the explanatory variable (also called a predictor or the independent variable) and income the response variable (also called the dependent variable). There are usually many explanatory variables that potentially affect the response variable.

Alternatively, we may want to use one variable to predict another without assuming that it actually causes it.

Scatterplots are graphs for examining the relationship between two quantitative variables. The correlation coefficient measures the degree to which two quantitative variables are related if their relationship is well summarized by a straight line.

In a scatterplot, the values of one variable (typically the explanatory variable) are on the x-axis and the other variable (the response variable) are in the y-axis. Each individual observation in the data is plotted as a point in the graph according to its values ($x_i$, $y_i$) for the two variables.

Let's start with scatterplots by looking at some obvious relationships.

Fascinating!

Now for something more interesting... health expenditures and infant mortality per 1,000 live births, restricted to countries that spend more than 2,000 USD per capita.

Interpreting Scatterplots

Do you see clusters? Are there outliers? Do you see some sort of relationship or association? If so, is there some sort of direction?

If the relationship takes the form of a straight line, we have a ... linear relationship! We can look at the correlation between two variables to get a better sense of the direction and the strength of their association.

Correlation

The correlation $r$ measures the strength and direction of the linear relationship between two variables $x$ and $y$. It is defined as:

$$ r = \frac{1}{n - 1} \sum \bigg( \frac{x_i - \bar{x}}{s_x} \bigg) \bigg( \frac{y_i - \bar{y}}{s_y} \bigg) $$

Calculating correlations by hand is tedious, to say the least. Do it in R instead, like a reasonable person.

cor()

If you want to walk through that formula a bit...

$s_x$ and $s_y$ are the standard deviations of $x$ and $y$.

$\big( \frac{x_i - \bar{x}}{s_x} \big)$ and $\big( \frac{y_i - \bar{y}}{s_y} \big)$ are therefore the standardized scores (z-scores) for $x$ and $y$ for observation $_i$.

The product $\big( \frac{x_i - \bar{x}}{s_x} \big) \big( \frac{y_i - \bar{y}}{s_y} \big)$ will be positive when both $x$ and $y$ are above their means or when both are below their means.

The product is negative when one variable is above its mean and the other below.

The sigma sign ($\sum$) means that these products are added up over all observations.

When there is a positive relationship between $x$ and $y$, the correlation is therefore positive; when there is a negative relationship, the correlation is negative.

The $x$ and $y$ variables in a correlation do not enter as explanatory and response variables.

Correlation has no units because it uses standardized variables. It does not change if we change the units of measurement of the variables.

Correlations indicate the strength of linear relationships and are always between −1 and +1.

  • A correlation of −1 signifies a perfect negative linear relationship, that is, all of the points are exactly on a line with negative slope (from upper left to lower right).
  • A correlation of +1 signifies a perfect positive linear relationship, all of the points are exactly on a line with positive slope (from lower left to upper right).
  • A correlation of 0 indicates no linear relationship between x and y.
  • Correlations are strongly affected by outliers
  • Correlations are not good measures of non-linear relationships, and can often produce correlations of 0 for strong non-linear relationships.

Correlation Matrices

Correlations are only appropriate for quantitative variables. They are nonsense when applied to non-quantitative variables.

And of course, correlation is NOT causation.

Regression Analysis

Missing a class has really put us behind. So today will be a broad overview of some of the key things about simple and multiple regression. It will be enough to give you a foundation for future learning about regression models.

It also means we have less time to work in R. I will make up for this by including some R programming for linear models when we work in R in the section on network analysis.

You will get your data challenges today. Complete them before Friday.

With correlation, we looked at how to measure the association between two quantitative variables. What if we want to predict one variable from another? Any particular outcome can be predicted by a combination of a model and some error.

$$ outcome_i = (model) + error_i $$

If there is a linear relationship between our response and explanatory variables, we can summarize the relationship between them with a straight line.

Why do criminal sentences vary? Could it be related to the number of prior convictions a person has?

We use the equation of a straight line. Let's start with just one explanatory variable, which makes this a simple regression:

$$ y = a + bx $$

where:

  • $a$ is the y-intercept of the line (i.e. the y-value corresponding to an x-value of 0)
  • $b$ is the slope of the line, indicating how much y changes when x is increased by 1

If $b$ is positive, the value of y increases as x increases. If $b$ is negative, it decreases as x increases. If $b$ = 0, the value of y does not change with x.

We will fit a regression model to our data and use it to predict values for the response variable.

$$ Y_i = (b_0 + b_1 X_i) + \epsilon_i $$

where:

  • $Y$ is the outcome we want to predict
  • $\epsilon$ is an error term / residual
  • $_i$ is an individual person's value (e.g. number of prior convictions)

$b_0$ and $b_1$ are regression coefficients.

  • $b_0$ is the y-intercept
  • $b_1$ is the slope of the line

Continuing with our example, if we want to predict the length of the sentence based on the number of prior convictions:

$$ Y_i = (b_0 + b_1 X_i) + \epsilon_i $$

or

$$ Y_i = (b_0 + b_1 Priors_i) + \epsilon_i $$

where :

the length of an individual's sentence is a function of (1) a baseline amount given to all defendants + (2) an additional amount for each prior conviction, and (3) a residual value that is unique to each individual case.

There are many lines we could fit to describe the data. To find the line of best fit, we typically use a method called least squares. The method of least squares will go through, or get close to, as many of the points as possible.

We will have both positive and negative residuals, because there will be data points that fall both above and below our line of best fit. We square the differences before adding them up to prevent the positive residuals (points above the line) from canceling out the negative residuals (below the line).

A less common way of doing this is to take their absolute values, but it more mathematically complicated. It's rarely done.

If the squared differences are very big, the line does a poor job of representing the data. If the squared differences are small, it does a good job of representing the data.

The line of best fit is the one with the lowest Sum of Squared Differences ($SS$ for short, or $\sum residual^2$). The method of least squares selects the line with the smallest $SS$. Or, according to Andy Fields, a bearded wizard called Nephwick the Line Finder finds the line of the best fit for us. As long as we don't have to do it by hand...

Goodness of Fit

We might have the best line possible, but is it a good fit?

We have the best line possible now. But what if it does a really bad job of actually fitting the data? To assess the goodness of fit:

  • $R^2$ tells us how much variance in the outcome variable that our model explains compared to the amount of variance that was there to explain in the first place.
  • $F$ tells us how much the model can explain relative to how much it can't explain.

These values will be reported in your R output.

Cautions

  • linear least squares is a good summary if the relationship is indeed linear, and if the data are "well-behaved"
  • there can be problems if there are influential outliers. (Not all outliers are influential.) Always plot your data, including plots like residuals against $x$ to help identify influential outliers.

Assumptions

the average values of $y$ are linearly related to $x$. $\mu_y = a + Bx$ the standard deviation of $y$ is the same regardless of the value of $x$ the distribution of $y$ at each value of $x$ is a normal distribution the values of $y$ are independent of one another

These assumptions can easily be wrong. We have attempt to check and see if the assumptions are reasonable.

Extrapolation

It is dangerous to summarize a relationship between $x$ and $y$ beyond the range of the data.

Lurking Variables

A lurking variable is one that has an important effect on the relationship between $x$ and $y$ but is omitted from the analysis. This can lead to you missing a relationship that is present, or inducing a relationship that is not present.

The possible presence of lurking variables makes causal analysis more different in observational work. Multiple regression does better than simple regression. Experimental designs are best are mitigating lurking variables.

Multiple Regression

Why do criminal sentences vary? deserved? moody judge? long criminal record? vicious crime? defendant's race? race of victim? There could be many theories. What is the relative importance of each variable?

By extending regression analysis to 2 or more explanatory variables, we (1) reduce the size of the residuals and therefore account for more variation in the response variable, and (2) can hold these additional causes of the response variable constant statistically, resulting in a more accurate estimation of the effect of $x$ on $y$ because it is less likely that we will omit lurking variables.

We can extend the number of explanatory variables in multiple regression to $k$ variables, $x_1, x_2, ..., x_k$ for the regression equation:

$$ y = a + b_1 x_1 + b_2 x_2 + ... + b_k x_k + residual $$

Now we have a new coefficient for each explanatory variable we add! :) The outcome is predicted from the combination of all the variables multiples by their respective coefficients and, of course, the residual term.

$b_1$ is the average change in $y$ for a one-unit increase in $x_1$ holding the other explanatory variables constant.

  • The slope $b1$ can also be interpreted as the average change in $y$ associated with a one-unit increase in $x1$ holding the value of $x2$ constant.
  • The slope $b2$ is similarly interpreted as the average change in $y$ associated with a one-unit increase in $x2$ holding $x1$ constant.
  • etc.

How do you add predictors to a model?

hierarchical: add known variables (from theory and previous research) first, add new variables last forced-entry: force all explanatory variables into the model at the same time stepwise: enter the variables into the model based on mathematical criteria, typically correlation with the outcome

A Final Caution

Association is not causation. Obviously.

Experiments are better for causal claims. We will talk about them tomorrow, when we get to Devah Pager's work!

Quantitative Challenge

plan a 10 question survey. identify a population and get a sampling frame. stop there.

plan a 30 question survey with skip logic. identify a population. no need to get a sampling frame, but include a discussion of measurement error. stop there.

write a press release for an article of your choice that explains, in detailed but plain language, the results of a multiple regression analysis. do not simply report the results as the author does.

Thinking Experimentally

Devah Pager. 2003. "The Mark of a Criminal Record." American Journal of Sociology. 108(5):937-75.

How is this study motivated? Why an experimental design?

What are the three main questions?

Do employers make decisions about hiring using information about criminal backgrounds?

Does race continue to serve as a barrier to employment?

Does criminal record differ for men and women?

How does the audit methodology work? What decisions did Pager make when designing the study this way?

What are the core findings?

What are the theoretical and policy implications?

Pager 2009. "Field Experiments for Studies of Discrimination" in Research Confidential.

What would Pager do differently if she was going to do it again?

Design an experimental audit study!

I have 100K in grant money to award. I will only award it to one team. You have 500 words to pitch your experimental audit study. What will you do? Why is it important? Why should I fund you?

Thinking Relationally

Social Networks Research

Basic Concepts

& Data Structures

Why social networks?

We are connected and interdependent. We are all part of multiple networks that we help shape, but that also shape the opportunities and constraints we encounter in our lives.

Types of Networks

  • ego networks vs. whole networks
  • one-mode vs. two-mode / affiliation / bipartite networks
  • valued vs. unvalued (or binary or weighted)
  • directed vs. undirected
  • longitudinal vs. cross-sectional
  • multiplex and multiple networks

Empirical Example

How did the Medici family come to dominate Florence during the 13th century? How did they control a city that many considered to be basically uncontrollable? In a classic paper, Padgett and Ansell show how the Medici family strategically used developed a marriage network with other powerful families, and they used that network to join a more integrated economic and political elite in Renaissance Florence. Let's use their data! :)

Network Data

  • adjacency (case by case) matrices
  • incidence (case by event) matrices
  • edge lists and attribute files
  • .graphml and other data structures

Adjacency Matrices

In an adjacency matrix, the rows and columns both have the ids for the nodes. If it is a binary network, a tie is indicated with a 1. If it is a weighted network, the cell value (e.g. 2, 3, 5, 19) represents the weight of the tie.

Edge list

The first two columns indicate relationship between two people. It may or may not be directed. It is common to also have a weight column in an edge list for a weighted network. Obviously, this edge list does not have one.

Levels in networks

nodes, dyads, triads, cohesive subgroups / communities, positions, network level

Important Network Ideas

reciprocity, preferential attachment, closure / triangulation / clustering, small worlds, strong and weak ties, network brokerage, closure with positive and negative ties, actor attributes: social selection, actor attributes: social influence or diffusion, network self-organization / endogenous network effects, dynamic network processes: co-evolution of structure and attributes, social capital, embeddedness, multiplexity, autocatalysis

Let's get our hands dirty!

We will begin with some very basic network visualizations using igraph for R. Along the way, I will also show you how to compute some basic descriptive statistics for networks (such as node centralities) and how to detect cohesive subgroups. We will go into these in more depth in other classes.

The Hartford Connecticut Drug User Network

This is an anonymized network dataset on people that share needles with one another. It is from a larger study on drug use, HIV risk behaviors, and housing conducted by Julia Dickson-Gomez and her research team at the Medical College of Wisconsin. We have a very small piece of their data to work with.

Visualizing the Network

The default plot doesn't look good, so we will change the size of the nodes / vertices, the width of the edges, and the color of the nodes. We will also remove the node labels and apply a network layout (Kamada and Kawai).

Your turn!

Ego Networks

What are they again?

The plan:

wait, what is social capital? a review of position generators a review of resource generators a review of name generators nominalist vs. realist approaches a brief review of other ways of collecting ego network data reliability, validity, accuracy write our own generators!

Social Capital

According to Nan Lin and Pierre Bourdieu, you have social capital when you have indirect access to resources via network connections. You do not directly own or control them yourself.

Research on social capital often employs position, resource, and name generators to collect data on indirect access to resources.

Position Generators

Respondents are asked to indicate whether or not they have a tie to anyone holding any of a list of n occupations. Typically, the list of occupations is adapted based on the research question, but sometimes they are not (e.g. in large government sponsored surveys).

E.g. page 46 of Crossley et al. 2015

Resource Generators

What if you have a tie to someone in a particular occupation, but they do not provide access to the resource the researcher is interested in?

In resource generator questions, respondents are asked to indicate whether or not they would have someone to turn to if they needed to access one or a range of resources.

E.g. page 49 of Crossley et al. 2015

Name Generators

Position and resource generators do not actually collect information about concrete ties. Name generators do. Typically, name generators typically include 3 elements:

alters: information on ego's relevant alters structure: information about the relationships between the alters alter attributes: when possible, basic information about the alters (e.g. gender, ethnic identity, occupation, etc.)

E.g. page 53 of Crossley et al. 2015

Nominalist vs. Realist

In this case:

Nominalist means that the researcher decides the boundaries of the ego's network.

Realist means that ego decides the boundaries of her own network.

Other Collection Methods

visual aids qualitative interviews ethnographic observation diaries archives the internet

reliability: the data gathered is not affected by contextual factors, e.g. the interviewer

validity: the data captured what the researchers aim to capture

accuracy: the people confidently recall and report their ties without intentional or unintentional omission

Writing Generators

Break into 3 groups Write generators on Google Forms
  • group 1: position
  • group 2: resource
  • group 3: name
Answer generators on Google Forms
  • the whole class answers the position and resource generators as ANON
  • nobody answers the name generators

Centrality Analysis and Structural Cohesion

Centrality and structural cohesion

  • What does it mean to be "central" in a network?
  • Why should we look for central actors?
  • Can you be central in different ways?
  • What does it mean for a network to be cohesive?
  • What does it mean for a subgroup within a network to be cohesive?

Centrality

Researchers began formally developing measures of centrality in social networks in the 1950s. In a classic 1979 article, Lin Freeman proposed that centrality measures have 3 properties:

they are computed for individuals they should be normalized by the size of the network to facilitate comparisons across networks they can be used to compute a centralization score for an entire network

Degree Centrality

The number of links to and from a person. In a directed network, in-degree is the number of ties received and out-degree is the numbers of links sent. Degree centrality is entirely local. It does not consider any information about the larger network.

Computing degree in igraph

We called our network "drugNet."

degree(drugNet)
##   [1]  3  4  3  4  1  2  3  4  3  4  2  1  3  2  3  2  5  4  6  2  6  4  3
##  [24]  1  1  4 11  5  3  3  4  2  5 10  1  2  4  1  2  1  1  5 15  3  3  7
##  [47]  2  2  7  1  1 11  7  4  4  5  2  1  4  3  5  2  3  3  3  5  3  3  5
##  [70]  2  3  3  1  3  3  2  5  3  1  1  2  1  4  4  1  4  4  2  4  5  1  4
##  [93]  2  3  2  5  1  2  5  2  6  3  5  2  3  1  2  2  1  2  1  2  5  2  6
## [116]  5  3  1  1  2  1  3  1  1  5  3  2  4  4  6  2  1  1  1  3  3  4  1
## [139]  3  3  2  4  4  2  1  2  3  2  1  2  1  1  5  2  1  5  1  3  4  2  5
## [162]  1  1  3  1  1  2  1  2  1  2  2  2  1  2  2  2  1  1  1  1  1  2  1
## [185]  1  1  1  1  2  2  2  2  1

Assign these scores to their vertices.

V(drugnet)$deg <- degree(drugNet)

Degree Centrality, Formally

You count the number of alters adjacent to ego. The formula for degree centrality for actor $i$ is:

$$C_D (i) = \sum_{j=1}^n x_{ij} = \sum_{i=1}^n x_{ji}$$

where:

  • $x_{ij}$ = the value (i.e. 0 or 1) of the tie from actor $i$ to actor $j$. this is the sum of all ties.
  • $n$ = the number of nodes in the network

Eigenvector Centrality

An extension of degree centrality. You take the degree centrality of each ego's alters. If they have high degree centrality, then the ego will have high eigenvector centrality.

Example: You just started a new job. You want to get to know as many colleagues as possible. Who do you want to be connect with?

Computing eigenvector centrality in igraph

V(drugNet)$eigen <- eigen_centrality(drugNet, scale = TRUE)
V(drugNet)$eigen
##   [1] 0.0033251156687 0.0034292523692 0.0030622104809 0.0001908052374
##   [5] 0.0413600314799 0.0000015719816 0.0008948607627 0.1943917816685
##   [9] 0.0014541613108 0.0126074728879 0.0427034738678 0.0002164736346
##  [13] 0.0038904777961 0.0002290518189 0.0571671384492 0.1593675611369
##  [17] 0.2388326076831 0.2228143815849 0.3424894088939 0.0652906721406
##  [21] 0.2862770764670 0.2494370104772 0.0011653172078 0.0196718717588
##  [25] 0.0102094452411 0.0537040712035 0.6750193370678 0.0008401657555
##  [29] 0.0001733457873 0.0005313444924 0.0007717021447 0.0863073175945
##  [33] 0.0009892576402 0.0039396516632 0.0651091581589 0.0007924939203
##  [37] 0.0010340365913 0.0002079796045 0.0000346671430 0.1901056104746
##  [41] 0.0474193751517 0.3067584331293 1.0000000000000 0.0181608044799
##  [45] 0.0011389741852 0.3859772896957 0.1662937033550 0.0846788848577
##  [49] 0.5133870704463 0.0000065904184 0.0547587570133 0.6359111883566
##  [53] 0.1573868744308 0.1581771131623 0.3319537333539 0.2693898746508
##  [57] 0.3109958951448 0.0000334142675 0.0001757668668 0.0128741552216
##  [61] 0.0097214472890 0.1313382091847 0.0942213395218 0.2023994787461
##  [65] 0.0060866438633 0.1234900167151 0.0484074365896 0.0418489372610
##  [69] 0.0000387196456 0.0000023192801 0.0000595360630 0.0002156801607
##  [73] 0.0000073608219 0.2010367074135 0.0000857344157 0.1333239757532
##  [77] 0.0301991839616 0.0259894287940 0.0000073608219 0.0000315022049
##  [81] 0.0000076368175 0.0000004409082 0.1946536331390 0.1774826459012
##  [85] 0.0733764482869 0.1596664093634 0.0010940213913 0.1316929474852
##  [89] 0.0025202386646 0.0738999557702 0.0002079796045 0.1034786491031
##  [93] 0.1331365366462 0.0000830067882 0.0018352901414 0.0013991976711
##  [97] 0.0000329540067 0.0005685914020 0.2880438766464 0.0886137953240
## [101] 0.0071338169774 0.0000830067882 0.2175634447434 0.1364431331245
## [105] 0.0000016332195 0.0000003104842 0.0000932092634 0.0000315601123
## [109] 0.0253100025756 0.0565225133311 0.0006519201151 0.0000622295605
## [113] 0.0004280729783 0.0015158988554 0.0007412907165 0.0017250386530
## [117] 0.0000079701486 0.0000362731461 0.0000003104842 0.0082540192011
## [121] 0.0015691353591 0.0009525644501 0.0107452469024 0.0001810878463
## [125] 0.2865764007335 0.0002341327877 0.0004049164069 0.0004123722502
## [129] 0.0234998240108 0.0007707590265 0.0003891665039 0.0044674483896
## [133] 0.0000014518019 0.0023967513299 0.1795529002524 0.0011581680644
## [137] 0.0110199400452 0.1901056104746 0.2364332436187 0.3678654259104
## [141] 0.0000117590478 0.0013818243662 0.0003220682309 0.0004049164069
## [145] 0.0018481016716 0.2600388918398 0.0365425444265 0.0310419897988
## [149] 0.0140487962057 0.0046485362359 0.0049990125977 0.0004131951421
## [153] 0.0028787784725 0.3184305736301 0.0000362731461 0.0031489638635
## [157] 0.0001880634276 0.0011121284264 0.0011387019776 0.0042955508093
## [161] 0.0881871329744 0.0059012564211 0.0069469427165 0.0970882597535
## [165] 0.0034524708224 0.0005821433929 0.0636254036710 0.0299201278444
## [169] 0.0632335860558 0.0001965761574 0.0262959761431 0.0001657089697
## [173] 0.0632335860558 0.0000213194631 0.0146309395986 0.0675190244827
## [177] 0.0001121453651 0.0057410343029 0.0001810878463 0.0299201278444
## [181] 0.0140487962057 0.0299201278444 0.1266313189730 0.0140487962057
## [185] 0.0000362731461 0.0020949524297 0.0000002988425 0.0057410343029
## [189] 0.0021735031444 0.0003289869518 0.0019200961091 0.0003787075895
## [193] 0.0000719944375

Betweenness Centrality

Like eigenvector centrality, it takes the structure of the network into account when computing individual scores. The key idea here is that it is not the number of people you know in the networks, it's where you are located in the network. Being positioned between two otherwise disconnected actors affords advantages. This is especially true if it lets ego control the flow of information or resources.

Computing Betweenness Centrality in igraph

V(drugNet)$bet <- betweenness(drugNet, normalized = TRUE)
V(drugNet)$bet
##   [1] 0.00725760236 0.01767426902 0.05687441134 0.03108638743 0.00000000000
##   [6] 0.01041666667 0.04362092787 0.16354003369 0.00741623729 0.05258239494
##  [11] 0.01250998674 0.00000000000 0.11848775361 0.00079897469 0.06203647435
##  [16] 0.00000000000 0.05524004655 0.06292543595 0.01991437609 0.00912049156
##  [21] 0.03931570992 0.01142833770 0.05381148023 0.00000000000 0.00000000000
##  [26] 0.03492673793 0.12044912318 0.14860994245 0.11840095986 0.00108581609
##  [31] 0.00341212471 0.00193608202 0.06951713753 0.17685792732 0.00000000000
##  [36] 0.00864783304 0.01600813593 0.00000000000 0.01041666667 0.00000000000
##  [41] 0.00000000000 0.04599103963 0.27507358827 0.12109127102 0.02713423502
##  [46] 0.09031215294 0.00178337696 0.00364856021 0.02520282764 0.00000000000
##  [51] 0.00000000000 0.16012357471 0.05372354930 0.03230538761 0.11544664932
##  [56] 0.08284003633 0.00000000000 0.00000000000 0.04078289911 0.01555280271
##  [61] 0.05244691681 0.00000000000 0.01498963787 0.01032075972 0.04664247971
##  [66] 0.36684736648 0.01366979246 0.02566405925 0.10083987784 0.01041666667
##  [71] 0.03092277487 0.04128698163 0.00000000000 0.00263903173 0.00775341768
##  [76] 0.02072425829 0.08558664194 0.35566847302 0.00000000000 0.00000000000
##  [81] 0.01041666667 0.00000000000 0.00465036254 0.00619234605 0.00000000000
##  [86] 0.03458520752 0.02430555556 0.00863479283 0.01415006302 0.10344945497
##  [91] 0.00000000000 0.09009456358 0.01041666667 0.00518106457 0.00000000000
##  [96] 0.03553115218 0.00000000000 0.02072425829 0.04187903837 0.00413576207
## [101] 0.32674367593 0.00518106457 0.04821701259 0.01350832638 0.02077879581
## [106] 0.00000000000 0.00081773820 0.00000000000 0.00000000000 0.01041666667
## [111] 0.00000000000 0.00002726876 0.01303433786 0.08413711252 0.05221558486
## [116] 0.08891630521 0.05131980803 0.00000000000 0.00000000000 0.01041666667
## [121] 0.00000000000 0.02077879581 0.00000000000 0.00000000000 0.19448216033
## [126] 0.00954439095 0.00000000000 0.00245873328 0.07513089005 0.07081468001
## [131] 0.01410249418 0.00000000000 0.00000000000 0.00000000000 0.00160340314
## [136] 0.01753390770 0.03599476440 0.00000000000 0.01964142878 0.00070807882
## [141] 0.02072425829 0.01332009716 0.03141997528 0.00000000000 0.00000000000
## [146] 0.00000000000 0.01041666667 0.01041666667 0.00000000000 0.03092277487
## [151] 0.00000000000 0.00000000000 0.04567422228 0.00000000000 0.00000000000
## [156] 0.02491334594 0.00000000000 0.01487115058 0.06692815362 0.00000000000
## [161] 0.12599220318 0.00000000000 0.00000000000 0.04176303006 0.00000000000
## [166] 0.00000000000 0.00463555950 0.00000000000 0.00000000000 0.00000000000
## [171] 0.01041666667 0.01041666667 0.00000000000 0.00000000000 0.06521865218
## [176] 0.00364319893 0.01041666667 0.00000000000 0.00000000000 0.00000000000
## [181] 0.00000000000 0.00000000000 0.08412321844 0.00000000000 0.00000000000
## [186] 0.00000000000 0.00000000000 0.00000000000 0.01041666667 0.00276648072
## [191] 0.02072425829 0.01041666667 0.00000000000

Betweenness Centrality, Formally

$$C_B (i) = \sum_{j < k} g _{jk} (i) / g _{jk}$$

where:

  • $g_{jk}$ = the number of geodesics connecting $jk$
  • $g_{jk} (i)$ = the number of geodesics that actor $i$ is on
  • a geodesic is the shortest path between two actors

betweeness centrality normalized:

$$ C^{\prime}_B (i) = C_B (i) / [(n - 1)(n - 2) / 2]$$

[(n - 1)(n - 2) / 2] = the number of pairs of vertices excluding the vertex itself

Closeness Centrality

If an ego is "close" to many different people, she is more independant. That is, she does not have to rely on specific people for information, resources, etc. She can reach many people without going though intermediaries. Closeness is computed by measuring the distance between actors in a network, where having short distances between many other nodes results in higher closeness centrality.

Computing closeness centrality in igraph

V(drugNet)$close <- closeness(drugNet, normalized = TRUE)
V(drugNet)$close
##   [1] 0.12467532 0.12475634 0.12565445 0.10223642 0.14436090 0.09943035
##   [7] 0.11340815 0.20512821 0.11267606 0.13983977 0.17036380 0.13892909
##  [13] 0.17219731 0.13521127 0.15907208 0.17407072 0.19296482 0.18011257
##  [19] 0.17843866 0.15867769 0.18250951 0.17777778 0.16000000 0.15559157
##  [25] 0.13016949 0.14953271 0.19315895 0.15699101 0.13812950 0.13724089
##  [31] 0.14065934 0.16026711 0.15384615 0.17630854 0.15153907 0.15106216
##  [37] 0.15070644 0.13530655 0.11721612 0.17663293 0.15106216 0.18426104
##  [43] 0.21428571 0.17943925 0.14769231 0.19433198 0.17598533 0.15763547
##  [49] 0.18622696 0.10497540 0.15508885 0.20622986 0.16681147 0.17614679
##  [55] 0.18622696 0.19433198 0.18285714 0.11707317 0.13250518 0.13417191
##  [61] 0.13943355 0.16652212 0.18461538 0.18096136 0.18199052 0.21452514
##  [67] 0.18479307 0.15471394 0.12299808 0.09805924 0.12136536 0.13521127
##  [73] 0.10958904 0.17647059 0.12443292 0.16257409 0.15635179 0.20425532
##  [79] 0.10958904 0.11977542 0.10971429 0.08934388 0.18285714 0.17860465
##  [85] 0.16284987 0.18934911 0.15635179 0.17173524 0.16509028 0.16066946
##  [91] 0.13530655 0.18408437 0.16229924 0.11657559 0.16340426 0.15946844
##  [97] 0.12144213 0.14826255 0.18338109 0.17598533 0.19374369 0.11657559
## [103] 0.16856892 0.17566331 0.09953344 0.09056604 0.12213740 0.10451824
## [109] 0.13973799 0.17391304 0.11098266 0.11808118 0.13882863 0.16976127
## [115] 0.15360000 0.16946161 0.11021814 0.09279845 0.09056604 0.13426573
## [121] 0.11844540 0.12252712 0.14826255 0.10921502 0.20983607 0.13361169
## [127] 0.14512472 0.13872832 0.16066946 0.15011728 0.14623001 0.13852814
## [133] 0.09891808 0.12276215 0.17142857 0.15699101 0.14667685 0.17663293
## [139] 0.18011257 0.18250951 0.10853590 0.15776500 0.13159698 0.14512472
## [145] 0.12244898 0.17679558 0.16466552 0.14328358 0.13852814 0.13913043
## [151] 0.12299808 0.11367673 0.17328520 0.18233618 0.09279845 0.17006200
## [157] 0.13342599 0.15647922 0.16120907 0.12598425 0.18695229 0.12540823
## [163] 0.14148858 0.18181818 0.15226011 0.11169284 0.17679558 0.14307004
## [169] 0.16298812 0.13105802 0.14014599 0.13597734 0.16298812 0.11462687
## [175] 0.14096916 0.17810761 0.12938005 0.13530655 0.10921502 0.14307004
## [181] 0.13852814 0.14307004 0.17679558 0.13852814 0.09279845 0.12800000
## [187] 0.09048068 0.13530655 0.12817089 0.13773314 0.12276215 0.10952653
## [193] 0.09876543

Closeness, Formally

$$C_C (i) = \bigg[ \sum_{j=1}^N d(ij) \bigg]^{-1}$$

where:

  • $d(ij)$ is the distance connecting actor $i$ to actor $j$

normalized:

$$C^{\prime}_C (i) = \big[C_C (i)\big]^{-1} (n-1)$$

Structural Cohesion

"cohesion" is often involved in social science research, e.g.:

  • community cohesion prevents crime
  • cohesive civil society promotes democracy
  • cohesive subgroups are responsible for the persistence of sexually transmitted infections

But what is cohesion?

Moody and White (2003) develop a social network definition of structural cohesion (of a group) and embeddedness (of a node). The basic idea is that structurally cohesive groups hold together when you start to strategically remove nodes.

Structural cohesion analysis involves recursively disconnecting a network by determining the minimum number of nodes that have to be removed break apart each component, and then deleting those nodes.

The result is a collection of k-components, where k is the minimum number of nodes required to disconnect each k-component. In other words, a minimum of 2 nodes must be removed to disconnect a 2-component, 3 nodes to disconnect a 3-component, and so on. As the algorithm progressively disconnects the network, it uncovers a nested hierarchy of cohesive groups, each one more deeply embedded in the network.

Modeling Networks

  • What is the central finding from "Chains of Affection?"
  • Why did Bearman, Moody, and Stovel model these networks rather than simply provide descriptive statistics about the observed network?
  • What did they run so many simulations?

From the abstract:"... The study offers a comparison of the structural characteristics of the observed network to simulated networks conditioned on the distribution of ties; the observed structure reveals networks characterized by longer contact chains and fewer cycles than expected. This article identifies the micromechanisms that generate networks with structural features similar to the observed network."

Exponential Random Graph Models (ERGMs)

Like other statistical models, ERG models represent theories we have about our observed data. ERGMs are models of network structure, not individual outcomes. They permit inferences about how networks are created and sustained by multiple tie formation processes (e.g. reciprocity, exchange, triadic closure, preferential attachment, homophily). We can examine competing theories all within a single analysis.

While standard statistical models assume independence, ERGMs assume and model dependence. In other words, ERGMs account for the fact that the presence of absence of ties is affected by the presence of absence of other ties.

ERGM Theory

networks are locally emergent; local configurations combine into larger structures network ties self-organize, but they are also influenced by actor attributes and other exogenous factors patterns in networks can be taken as evidence for processes that create and sustain the network multiple processes can operate simultaneously social networks are structured, yet stoch