Project Milestone 1: Dataset Selection & Exploratory Data Analysis

I306 Statistics for Informatics

Overview

For your semester project, you will conduct a complete statistical analysis of a dataset of your choosing. This first milestone focuses on selecting an appropriate dataset and performing initial exploratory data analysis.

Due: End of Week 4 Points: 40

Dataset Requirements

Your dataset must meet the following criteria:

  • At least 500 observations (rows)
  • At least 5 variables
  • A mix of numeric and categorical variables
  • Publicly available (provide the source URL)

Suggested Data Sources

Deliverables

Submit a Quarto document (.qmd) and its rendered PDF containing:

1. Dataset Description (10 points)

  • What is your dataset about?
  • Where did it come from? Include a URL.
  • How was the data collected?
  • What are the observational units (what does each row represent)?

2. Variable Descriptions (10 points)

Create a table listing each variable:

Variable Name Type (Numeric/Categorical) Description

Identify which variables you plan to use as:

  • Response variable(s)
  • Explanatory variable(s)

*Note: If your dataset has more than 10 or so variables, just list 10 that you find interesting.

3. Summary Statistics (10 points)

For numeric variables:

  • Mean, median, standard deviation, min, max
  • Identify any obvious outliers (extreme values that don’t seem to fit the distribution of the rest of the data).

For categorical variables:

  • Frequency counts
  • Proportions

4. Contingency Table (10 points)

Create at least one contingency table showing the relationship between two categorical variables. If your dataset doesn’t have two suitable categorical variables, you may bin a numeric variable into categories.

Submission

Submit your .qmd source file and rendered output (PDF or HTML) to Canvas by the due date.

Grading Rubric

Component Points Criteria
Dataset Description 10 Complete, accurate description with source
Variable Descriptions 10 All variables documented with types
Summary Statistics 10 Appropriate statistics computed and interpreted
Contingency Table 10 Correctly constructed with meaningful interpretation

Tips

  • Choose a dataset you find genuinely interesting—you’ll be working with it all semester
  • Make sure your dataset is rich enough to support the analyses we’ll cover: visualization, hypothesis testing, confidence intervals, and regression
  • If you’re unsure whether your dataset is appropriate, ask during office hours