Skip to content

Commit d4c9a1b

Browse files
committed
feat(lec06): release
1 parent b839328 commit d4c9a1b

File tree

9 files changed

+124
-38
lines changed

9 files changed

+124
-38
lines changed

01-intro/programming-basics.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ When our notes reference other texts, we will provide the terminology/ideas that
3737
| Subtraction | `-` | `15 - 4` | `11` |
3838
| Multiplication | `*` | `-2 * 9` | `-18` |
3939
| Division | `/` | `15 / 2` | `7.5` |
40-
| Integer division | Cuts off remainder | `//` | `15 // 2` | `7` |
40+
| Integer division<br/>(Cuts off remainder) | `//` | `15 // 2` | `7` |
4141
| Remainder/Modulo | `%` | `19 % 3` | `1` <br/> (19 ÷ 3 = 6 Remainder 1) |
4242
| Exponentiation | `**` | `3 ** 2` | `9` |
4343

06-research/cause-effect.qmd

Lines changed: 0 additions & 28 deletions
This file was deleted.
55.1 KB
Loading

06-variables-ii/index.qmd

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: "Causality vs. EDA"
3+
subtitle: ""
4+
---
5+
6+
## John Snow and the Broad Street Pump
7+
8+
::: {.callout-note title="Read _Inferential Thinking_"}
9+
Read [all of Chapter 2](https://inferentialthinking.com/chapters/02/causality-and-experiments.html), which describes in detail experimental setup and design. It covers a core story to data scientists: John Snow and the Broad Street Pump.
10+
11+
Before continuing, make sure that you are familiar with the following terminology:
12+
13+
* observational study
14+
* causality
15+
* association
16+
* comparison
17+
* treatment group
18+
* control group
19+
* randomized controlled trial (RCT)
20+
21+
:::
22+
23+
## Randomized Controlled Trials vs. Observational Studies
24+
25+
In the Broad Street Pump experiment, John Snow established a causal relationship (between what? Read _Inferential Thinking_ Chapter 2 to find out) because he noted that there was no systematic difference between the two different groups observed other than along a single variable dimension.
26+
27+
In modern days, randomized controlled trials are excellent ways to compare two groups of otherwise similar individuals. However, in the majority of this class we will _not_ be able to conduct a randomized controlled trial. This is because the datasets we analyze are almost all **observational studies** and not experiments. Moreover, these datasets are largely pre-existing materials collected by other researchers, and we may not know the entire picture of how they collected the data. As a result, in this class we seek to understand associations between variables, and we will almost never seek to establish causal relationships between variables.
28+
29+
To further understand causality, we encourage you to take inferential thinking courses like Data 8, Stat 20, and a wide range of Statistics courses.
30+
31+
## Confounding
32+
33+
From _Inferential Thinking_, [Ch 3.2 Establishing Causality](https://inferentialthinking.com/chapters/02/3/establishing-causality.html):
34+
35+
> **In an observational study, if the treatment and control groups differ in ways other than the treatment, it is difficult to make conclusions about causality.**
36+
>
37+
> An underlying difference between the two groups (other than the treatment) is called a confounding factor, because it might confound you (that is, mess you up) when you try to reach a conclusion.
38+
39+
**Confounding** occurs when two
40+
variables can be consistently
41+
associated with each other
42+
even when one does not cause
43+
the other.
44+
45+
To determine whether a **confounding variable** can account for the association between two variables, we can try to **disaggregate** by different values of the confounding variable.
46+
47+
This disaggregation process can be repeated exhaustively for a potentially infinite number of confounding variables. Researchers generally don't do this. Instead, we usually rely on assumptions drawn from social science theory or findings from prior studies. This process can narrow our search of potential confounding variables that may influence the association between two variables.
48+
49+
50+
## Exploratory Data Analysis
51+
52+
If we're _not_ going to be studying causal relationships in this class, what _will_ we be looking at? In this course we will look deeply at a core component of Data Science: **Exploratory Data Analysis**, or **EDA**.
53+
54+
Exploratory Data Analysis (EDA) is like detective work. As coined by the famous American statistician and mathematician [John Tukey](https://en.wikipedia.org/wiki/John_Tukey) (we will discuss Tukey numbers soon):
55+
56+
> Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.
57+
58+
More formally, **Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data.
59+
60+
## Data Wrangling
61+
62+
A process very closely related to EDA is **data wrangling**, often called **data cleaning**. Data wrangling is the process of transforming raw data to facilitate subsequent analysis and can address issues like unclear structure or formatting, missing or corrupted values, unit conversions, and so on.
63+
64+
EDA and data cleaning are often thought of as an “infinite loop,” with each process driving the other.
65+
66+
Fortunately, in our classes we will try our best to work with "clean" datasets. These datasets will often have already been preprocessed for cleaner analysis, allowing us to explore and ask questions much more easily than if we were stuck with messier data.
67+
68+
## External Reading
69+
70+
* (mentioned in notes) _Computational and Inferential Thinking_, [Ch 5.1](https://inferentialthinking.com/chapters/05/1/Arrays.html)
71+
* "Chapter 15: From Concepts to Models." Elizabeth Heger Boyle, Deborah Carr, Benjamin Cornwell, Shelley Correll, Robert Crosnoe, Jeremy Freese, and Waters, Mary C. 2017. _The Art and Science of Social Research_. New York: W. W. Norton & Company.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: "Sample vs. Population"
3+
subtitle: ""
4+
---
5+
6+
## Population vs. Sample
7+
8+
Given a research question, the **population** is the group you want to learn something about However, directly studying the population as a whole is often not possible! Data might not exist at that scale, or it might be too costly to collect, if it’s even possible to gather that information.
9+
10+
Many times, we instead study a **sample** of the population. If the sample is a good representation of the population, we can make useful analyses at a much lower cost.
11+
12+
### Sampling Frame
13+
14+
The set of individuals we actually draw our sample from is the **sampling frame**. Depending on how we select our sample, we may miss individuals from the population we’re interested in, and we might also include individuals that are not in the population.
15+
16+
## Examples
17+
18+
:::{style="text-align: center"}
19+
![A sampling frame may include individuals not in our population.](images/sampling-frame.png){#fig-inflation fig-align=center width=80%}
20+
:::
21+
22+
| Target Population | Collected sample |
23+
| --- | --- |
24+
| Student body of the school | A specific classroom of students at the school |
25+
| A bag of 100 marbles | 10 marbles from the bag |
26+
| Computing Education Research (CER) papers | Papers published at the American Society of Engineering Education (ASEE) conference |
27+
28+
: Examples of target populations and the collected samples.
29+
30+
In the last example of the table, it is possible some of the research papers published at the ASEE conference are not specific to CER and may perhaps be focused on education in other engineering fields, like mechanical engineering or civil engineering. The sampling frame may be inferred to be the ASEE conference, and then the sample collected would need to be adjusted to include just the CER papers we want.
31+
32+
### A longer example
33+
34+
Let’s say you’re planning a social event for all Data Science-declared sophomores (second-years). Since you only have the budget to cater pizza, you want to figure out what pizza toppings Data Science sophomores enjoy, and buy pizza toppings according to how popular they are.
35+
36+
In order to figure out the most popular flavors, you survey every student walking into or out of Warren Hall (where Data Science course office hours are located) from 12PM to 1PM by asking what their favorite topping is. Assume that everyone responds.
37+
38+
* **Population**: Data Science sophomores
39+
* **Sampling frame**: Students walking into/out of Warren Hall between 12PM and 1PM
40+
41+
If we draw a sample from this sampling frame we may not get a **representative** sample because we will get respondents from not just the sophomore pool, but also freshman, juniors, seniors, non-Data Science majors, and generally many more students than our target population. These students may have different preferences than Data Science sophomores.
42+
43+
## How do we construct representative samples?
44+
45+
We will (hopefully) get into this topic later this semester.

_quarto.yml

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -67,15 +67,13 @@ website:
6767
text: "Variables and Variable Types"
6868
- href: 05-variables/units-of-analysis.qmd
6969
text: "Units of Analysis"
70-
# - section: "L06: Research and Design"
71-
# href: 06-research/index.qmd
72-
# contents:
73-
# - href: 06-research/index.qmd
74-
# text: "The Scientific Method"
75-
# - href: 06-research/cause-effect.qmd
76-
# text: "Cause and Effect"
77-
# - href: 06-research/eda.qmd
78-
# text: "Exploratory Data Analysis"
70+
- section: "L06: Causality vs. EDA"
71+
href: 06-variables-ii/index.qmd
72+
contents:
73+
- href: 06-variables-ii/index.qmd
74+
text: "Causality vs. EDA"
75+
- href: 06-variables-ii/sample-population.qmd
76+
text: "Sample vs. Population"
7977
- href: five-things.qmd
8078
text: "5 Things to Know"
8179
- href: reference.qmd

0 commit comments

Comments
 (0)