Post

Introduction to Data Science

A notebook covering the data science lifecycle, statistical data types, EDA, data cleaning, visualization, and prediction fundamentals.

Introduction to Data Science

β€œThe application of data-centric computational and inferential thinking to understand the world and solve problems.” β€” Joseph Gonzalez


πŸ‘₯ What is Data Science? Roles Compared

Β Data ScientistsData Engineers
FocusOptimize data processingOptimize data flow
TasksDefine metrics, establish collection methods, work with enterprise systemsPipeline design, data infrastructure
Β Data ScientistsStatisticians
FocusCollection, cleaning, ML modelsSurveys, polls, experiments
MethodsMachine learning at scaleImprove upon simple statistical models
Β Data ScientistsBusiness Analysts
FocusAutomate reports, data extractionDatabase design, ROI assessment
OutputML models, pipelinesFinance planning, risk management

πŸ”„ The Data Science Life Cycle

flowchart LR
    Q["❓ Define\nQuestion/Problem"]:::step --> A["πŸ“₯ Acquire &\nClean Data"]:::step
    A --> E["πŸ” Exploratory\nData Analysis"]:::step
    E --> P["🎯 Predict / Infer\nConclusions"]:::step
    P --> Q

    classDef step fill:#4A90D9,stroke:#2c5f8a,color:#fff
StepDescription
1Determine a question or problem
2Acquire and clean relevant data
3Conduct exploratory data analysis
4Predict / infer conclusions from the data

🐍 Python vs R

PythonR
Data analysis integrated in web applicationsData analysis using standalone computing on individual servers
Statistics code in production databasesStrong statistical and academic community
  • Python Package Index (PyPi): A Python software repository consisting of various libraries
  • Comprehensive R Archive Network (CRAN): A repository of R packages

For most ML and deep learning work, Python is the default choice β€” the ecosystem (NumPy, Pandas, PyTorch, Scikit-learn) is unmatched.


πŸ“Š Tabular Data

Tabular data is arranged in rows and columns. The most common file format is CSV (Comma-Separated Values), where each record is a line in the file and fields are separated by commas.

1
2
3
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

πŸ“ Statistical Data Types

TypeDescriptionExamples
Nominal (No Order)Categories with no inherent rankingGender, Language, ID numbers
Ordinal (Ordered Categories)Clear ranking, but unequal intervalsClothing size, Education level, Star ratings
Numerical (Quantitative)Measurable quantitiesSalary, Height, Weight, Price

πŸ” Exploratory Data Analysis (EDA)

An approach to analyzing datasets to summarize their main characteristics. The goal of EDA is to deeply understand the data you have. You visualize and transform data to pick up patterns, issues, and interesting signals.

Key Properties to Examine

1️⃣ Granularity

What does each record represent? How fine or coarse is the data? This determines what type of analysis is possible.

2️⃣ Scope

What does the dataset cover? Does it include the population, geography, platform, or timeframe you’re interested in?

3️⃣ Temporality

How is the data situated in time? When was it collected? What time fields exist? What date formats are used?

4️⃣ Faithfulness

Does the dataset accurately reflect reality?

Signs of unfaithful data:

  • Unrealistic values (e.g., negative quantities)
  • Future dates for past events
  • Non-existent locations
  • Extreme outliers
  • Inconsistent columns (e.g., age contradicts birthday)
  • Misspellings or shifted columns
  • Fake or duplicated entries

Data faithfulness checks should happen before any analysis. A model trained on corrupted data will produce confidently wrong results.


🧹 Data Cleaning

Detecting and fixing corrupt or inaccurate records from a record set.

IssueDescriptionFix
Missing valuesEmpty fields (blank age, missing salary)Fill with mean/median, remove rows, or flag
FormattingInconsistent formats (01/02/24 vs 2024-02-01, Male vs male)Standardize formats
StructureMultiple values in one column, unclear column namesNormalize to rows = observations, columns = variables
Complex valuesCombined info (β€œ180cm / 75kg” in one cell), nested/encoded dataParse or separate into components
Unit conversionMixed units (kg vs pounds, USD vs SGD)Convert to consistent unit system
Magnitude interpretationOutliers (age = 999), different scales (0–1 vs 0–100)Clip, normalize, or investigate outliers

πŸ“ˆ Data Visualization

Two useful visualization tools are Python’s Matplotlib and Seaborn libraries:

  • Matplotlib β€” create two-dimensional plots of your data
  • Seaborn β€” based on Matplotlib; supports multidimensional plots and more advanced visualizations

Visualizing Qualitative Data (Nominal/Ordinal)

  • Visualize counts using countplot()
  • Visualize averages using barplot()

Visualizing Quantitative Data

  • Histogram β†’ shows distribution of one numerical variable
  • Scatter plot β†’ shows relationship between two numerical variables

🎯 Prediction in Data Science

Two Main Types of Prediction Tasks

A. Classification (Predicts categories)

  • Used to predict a categorical variable (a category or label)
  • Example: Predicting whether an image is a dog or a cat
  • Common method: K-Nearest Neighbors (KNN) β€” looks at the closest labeled data points; new data classified based on nearest neighbors

B. Regression (Predicts numbers)

  • Used to predict a continuous variable (a number)
  • Example: Predicting the price of a car
  • Common method: Linear Regression β€” models a linear relationship between variables; uses that relationship to make predictions

Classification outputs labels; Regression outputs numbers. Choosing the right type is the first decision in any prediction task.


Part of my data science foundations series. Next: statistical testing, correlation, and feature engineering.

This post is licensed under CC BY 4.0 by the author.