Introduction to Data Science

A notebook covering the data science lifecycle, statistical data types, EDA, data cleaning, visualization, and prediction fundamentals.

Posted Feb 20, 2026 Updated Mar 20, 2026

Data science charts and analysis

By YuXuan Yan

4 min read

Introduction to Data Science

“The application of data-centric computational and inferential thinking to understand the world and solve problems.” — Joseph Gonzalez

👥 What is Data Science? Roles Compared

	Data Scientists	Data Engineers
Focus	Optimize data processing	Optimize data flow
Tasks	Define metrics, establish collection methods, work with enterprise systems	Pipeline design, data infrastructure

	Data Scientists	Statisticians
Focus	Collection, cleaning, ML models	Surveys, polls, experiments
Methods	Machine learning at scale	Improve upon simple statistical models

	Data Scientists	Business Analysts
Focus	Automate reports, data extraction	Database design, ROI assessment
Output	ML models, pipelines	Finance planning, risk management

🔄 The Data Science Life Cycle

flowchart LR
    Q["❓ Define\nQuestion/Problem"]:::step --> A["📥 Acquire &\nClean Data"]:::step
    A --> E["🔍 Exploratory\nData Analysis"]:::step
    E --> P["🎯 Predict / Infer\nConclusions"]:::step
    P --> Q

    classDef step fill:#4A90D9,stroke:#2c5f8a,color:#fff

Step	Description
1	Determine a question or problem
2	Acquire and clean relevant data
3	Conduct exploratory data analysis
4	Predict / infer conclusions from the data

🐍 Python vs R

Python	R
Data analysis integrated in web applications	Data analysis using standalone computing on individual servers
Statistics code in production databases	Strong statistical and academic community

Python Package Index (PyPi): A Python software repository consisting of various libraries
Comprehensive R Archive Network (CRAN): A repository of R packages

For most ML and deep learning work, Python is the default choice — the ecosystem (NumPy, Pandas, PyTorch, Scikit-learn) is unmatched.

📊 Tabular Data

Tabular data is arranged in rows and columns. The most common file format is CSV (Comma-Separated Values), where each record is a line in the file and fields are separated by commas.

  
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

📐 Statistical Data Types

Type	Description	Examples
Nominal (No Order)	Categories with no inherent ranking	Gender, Language, ID numbers
Ordinal (Ordered Categories)	Clear ranking, but unequal intervals	Clothing size, Education level, Star ratings
Numerical (Quantitative)	Measurable quantities	Salary, Height, Weight, Price

🔍 Exploratory Data Analysis (EDA)

An approach to analyzing datasets to summarize their main characteristics. The goal of EDA is to deeply understand the data you have. You visualize and transform data to pick up patterns, issues, and interesting signals.

Key Properties to Examine

1️⃣ Granularity

What does each record represent? How fine or coarse is the data? This determines what type of analysis is possible.

2️⃣ Scope

What does the dataset cover? Does it include the population, geography, platform, or timeframe you’re interested in?

3️⃣ Temporality

How is the data situated in time? When was it collected? What time fields exist? What date formats are used?

4️⃣ Faithfulness

Does the dataset accurately reflect reality?

Signs of unfaithful data:

Unrealistic values (e.g., negative quantities)
Future dates for past events
Non-existent locations
Extreme outliers
Inconsistent columns (e.g., age contradicts birthday)
Misspellings or shifted columns
Fake or duplicated entries

Data faithfulness checks should happen before any analysis. A model trained on corrupted data will produce confidently wrong results.

🧹 Data Cleaning

Detecting and fixing corrupt or inaccurate records from a record set.

Issue	Description	Fix
Missing values	Empty fields (blank age, missing salary)	Fill with mean/median, remove rows, or flag
Formatting	Inconsistent formats (`01/02/24` vs `2024-02-01`, `Male` vs `male`)	Standardize formats
Structure	Multiple values in one column, unclear column names	Normalize to rows = observations, columns = variables
Complex values	Combined info (“180cm / 75kg” in one cell), nested/encoded data	Parse or separate into components
Unit conversion	Mixed units (kg vs pounds, USD vs SGD)	Convert to consistent unit system
Magnitude interpretation	Outliers (age = 999), different scales (0–1 vs 0–100)	Clip, normalize, or investigate outliers

📈 Data Visualization

Two useful visualization tools are Python’s Matplotlib and Seaborn libraries:

Matplotlib — create two-dimensional plots of your data
Seaborn — based on Matplotlib; supports multidimensional plots and more advanced visualizations

Visualizing Qualitative Data (Nominal/Ordinal)

Visualize counts using countplot()
Visualize averages using barplot()

Visualizing Quantitative Data

Histogram → shows distribution of one numerical variable
Scatter plot → shows relationship between two numerical variables

🎯 Prediction in Data Science

Two Main Types of Prediction Tasks

A. Classification (Predicts categories)

Used to predict a categorical variable (a category or label)
Example: Predicting whether an image is a dog or a cat
Common method: K-Nearest Neighbors (KNN) — looks at the closest labeled data points; new data classified based on nearest neighbors

B. Regression (Predicts numbers)

Used to predict a continuous variable (a number)
Example: Predicting the price of a car
Common method: Linear Regression — models a linear relationship between variables; uses that relationship to make predictions

Classification outputs labels; Regression outputs numbers. Choosing the right type is the first decision in any prediction task.

Part of my data science foundations series. Next: statistical testing, correlation, and feature engineering.

Data Science, Fundamentals

This post is licensed under CC BY 4.0 by the author.