Introduction to Data Science
A notebook covering the data science lifecycle, statistical data types, EDA, data cleaning, visualization, and prediction fundamentals.
βThe application of data-centric computational and inferential thinking to understand the world and solve problems.β β Joseph Gonzalez
π₯ What is Data Science? Roles Compared
| Β | Data Scientists | Data Engineers |
|---|---|---|
| Focus | Optimize data processing | Optimize data flow |
| Tasks | Define metrics, establish collection methods, work with enterprise systems | Pipeline design, data infrastructure |
| Β | Data Scientists | Statisticians |
|---|---|---|
| Focus | Collection, cleaning, ML models | Surveys, polls, experiments |
| Methods | Machine learning at scale | Improve upon simple statistical models |
| Β | Data Scientists | Business Analysts |
|---|---|---|
| Focus | Automate reports, data extraction | Database design, ROI assessment |
| Output | ML models, pipelines | Finance planning, risk management |
π The Data Science Life Cycle
flowchart LR
Q["β Define\nQuestion/Problem"]:::step --> A["π₯ Acquire &\nClean Data"]:::step
A --> E["π Exploratory\nData Analysis"]:::step
E --> P["π― Predict / Infer\nConclusions"]:::step
P --> Q
classDef step fill:#4A90D9,stroke:#2c5f8a,color:#fff
| Step | Description |
|---|---|
| 1 | Determine a question or problem |
| 2 | Acquire and clean relevant data |
| 3 | Conduct exploratory data analysis |
| 4 | Predict / infer conclusions from the data |
π Python vs R
| Python | R |
|---|---|
| Data analysis integrated in web applications | Data analysis using standalone computing on individual servers |
| Statistics code in production databases | Strong statistical and academic community |
- Python Package Index (PyPi): A Python software repository consisting of various libraries
- Comprehensive R Archive Network (CRAN): A repository of R packages
For most ML and deep learning work, Python is the default choice β the ecosystem (NumPy, Pandas, PyTorch, Scikit-learn) is unmatched.
π Tabular Data
Tabular data is arranged in rows and columns. The most common file format is CSV (Comma-Separated Values), where each record is a line in the file and fields are separated by commas.
1
2
3
import pandas as pd
df = pd.read_csv('data.csv')
df.head()
π Statistical Data Types
| Type | Description | Examples |
|---|---|---|
| Nominal (No Order) | Categories with no inherent ranking | Gender, Language, ID numbers |
| Ordinal (Ordered Categories) | Clear ranking, but unequal intervals | Clothing size, Education level, Star ratings |
| Numerical (Quantitative) | Measurable quantities | Salary, Height, Weight, Price |
π Exploratory Data Analysis (EDA)
An approach to analyzing datasets to summarize their main characteristics. The goal of EDA is to deeply understand the data you have. You visualize and transform data to pick up patterns, issues, and interesting signals.
Key Properties to Examine
1οΈβ£ Granularity
What does each record represent? How fine or coarse is the data? This determines what type of analysis is possible.
2οΈβ£ Scope
What does the dataset cover? Does it include the population, geography, platform, or timeframe youβre interested in?
3οΈβ£ Temporality
How is the data situated in time? When was it collected? What time fields exist? What date formats are used?
4οΈβ£ Faithfulness
Does the dataset accurately reflect reality?
Signs of unfaithful data:
- Unrealistic values (e.g., negative quantities)
- Future dates for past events
- Non-existent locations
- Extreme outliers
- Inconsistent columns (e.g., age contradicts birthday)
- Misspellings or shifted columns
- Fake or duplicated entries
Data faithfulness checks should happen before any analysis. A model trained on corrupted data will produce confidently wrong results.
π§Ή Data Cleaning
Detecting and fixing corrupt or inaccurate records from a record set.
| Issue | Description | Fix |
|---|---|---|
| Missing values | Empty fields (blank age, missing salary) | Fill with mean/median, remove rows, or flag |
| Formatting | Inconsistent formats (01/02/24 vs 2024-02-01, Male vs male) | Standardize formats |
| Structure | Multiple values in one column, unclear column names | Normalize to rows = observations, columns = variables |
| Complex values | Combined info (β180cm / 75kgβ in one cell), nested/encoded data | Parse or separate into components |
| Unit conversion | Mixed units (kg vs pounds, USD vs SGD) | Convert to consistent unit system |
| Magnitude interpretation | Outliers (age = 999), different scales (0β1 vs 0β100) | Clip, normalize, or investigate outliers |
π Data Visualization
Two useful visualization tools are Pythonβs Matplotlib and Seaborn libraries:
- Matplotlib β create two-dimensional plots of your data
- Seaborn β based on Matplotlib; supports multidimensional plots and more advanced visualizations
Visualizing Qualitative Data (Nominal/Ordinal)
- Visualize counts using
countplot() - Visualize averages using
barplot()
Visualizing Quantitative Data
- Histogram β shows distribution of one numerical variable
- Scatter plot β shows relationship between two numerical variables
π― Prediction in Data Science
Two Main Types of Prediction Tasks
A. Classification (Predicts categories)
- Used to predict a categorical variable (a category or label)
- Example: Predicting whether an image is a dog or a cat
- Common method: K-Nearest Neighbors (KNN) β looks at the closest labeled data points; new data classified based on nearest neighbors
B. Regression (Predicts numbers)
- Used to predict a continuous variable (a number)
- Example: Predicting the price of a car
- Common method: Linear Regression β models a linear relationship between variables; uses that relationship to make predictions
Classification outputs labels; Regression outputs numbers. Choosing the right type is the first decision in any prediction task.
Part of my data science foundations series. Next: statistical testing, correlation, and feature engineering.