College Major vs Your Salary — Data Exploration with Pandas
The standard advice is "pick STEM." This analysis uses a WSJ/PayScale survey of 1.2 million Americans to stress-test that claim across 51 undergraduate majors. By engineering salary risk scores, growth trajectories, earnings ceilings, and rank-change metrics, the data reveals a more nuanced picture: STEM leads on starting salary, but HASS majors grow at nearly the same rate. Math more than doubles in salary by mid-career. Economics' top earners out-earn every engineering field. And the safest long-term bet by earnings predictability isn't engineering at all — it's Nursing. The analysis surfaces 8 concrete findings through a full EDA pipeline, correlation analysis, and 7 charts, all committed to the repo and visible without running code.
Quick Facts
Overview
Problem
When thinking about what to study, salary data exists — but it's buried in flat files with no way to quickly answer the questions that actually matter: which majors pay off fastest, which are the safest long-term bets, and how do broad degree categories really compare? Raw CSVs don't surface those patterns on their own. Without programmatic sorting, filtering, and aggregation, you end up eyeballing rows manually and missing the bigger picture entirely. The real friction is turning a flat table into meaningful salary rankings, risk scores, and group comparisons.
Solution
Loaded the dataset into a pandas DataFrame and ran a structured EDA pipeline: shape inspection, null detection (caught a silent NaN footer row with .tail() and .isna()), cleaning with .dropna(), then index-based lookups via .idxmax() / .idxmin() to surface salary extremes by major name. Engineered seven derived columns — Spread (P90 − P10 as an earnings-risk proxy), Growth % ((Mid − Start) / Start × 100), Safety score (Start / Spread), Start and Mid ranks, Rank_Change, Salary_Band (pd.cut), and Group_Avg_Start (groupby transform). Ranked majors across four lenses, ran a pairwise Pearson correlation matrix with upper-triangle masking, and generated seven publication-quality charts (bar, grouped bar, scatter + regression, heatmap, boxplot, rank-change bar). Results exported to CSV and JSON. A live stats DataFrame built from pre-computed variables surfaces all key findings in a single table that updates automatically on every run.
Challenges
The most significant challenge was data integrity — not technical, but analytical. Early findings confidently claimed that HASS majors beat the average STEM starting salary, and that Philosophy was the biggest rank climber. Both were wrong. Economics is categorised as Business in this dataset, not HASS. Nursing is also Business. Catching these errors required auditing every static claim against actual computed output and rebuilding the findings from scratch. The corrected version — Journalism as the biggest climber, STEM edging HASS on growth 70.40% vs 68.86% — is a better story precisely because it came from verifying the data rather than assuming the obvious interpretation.
Results / Metrics
60+ pandas operations demonstrated across inspection, cleaning, feature engineering, multi-condition filtering, aggregation, groupby, reshaping, apply/map, correlation analysis, styling, and export. Seven charts committed to plots/ and visible without running code. Key findings: Chemical Engineering highest mid-career median at $107,000; Math biggest salary grower at 103.52%; Nursing lowest earnings risk with a $50,700 spread; Economics highest P90 ceiling at $210,000; starting salary predicts mid-career salary (Pearson r = 0.848); STEM out-earns HASS by ~$27,844 at mid-career; HASS and STEM have nearly identical growth trajectories (68.86% vs 70.40%).
Screenshots
Click to enlarge.
Click to enlarge.
No screenshots available yet.
Videos
No videos available yet.