Ojaswi Athghara | NumPy for Beginners: Complete Data Analysis Fundamentals Guide

NumPy for Beginners: Complete Data Analysis Fundamentals Guide

My First Encounter with NumPy

"Just use NumPy," they said. "It's easy," they said. I stared at my screen, completely lost. Arrays? Broadcasting? Vectorization? The documentation assumed I already knew what I was doing.

Sound familiar? I've been there. NumPy seemed like this magical tool everyone used, but nobody explained it in plain English. After months of frustration and breakthroughs, I finally "got it."

This guide is what I wish I had when starting. No jargon. No assumptions. Just clear explanations and practical examples that will take you from "What's NumPy?" to confidently analyzing data.

What is NumPy and Why Should You Care?

NumPy (Numerical Python) is the foundation of data science in Python. Think of it as a supercharged calculator that can handle millions of numbers at once.

Why NumPy Over Python Lists?

import numpy as np
import time

# Create a million numbers
numbers_list = list(range(1000000))
numbers_array = np.array(numbers_list)

# Time a simple operation: multiply everything by 2
start = time.time()
doubled_list = [x * 2 for x in numbers_list]
list_time = time.time() - start

start = time.time()
doubled_array = numbers_array * 2
numpy_time = time.time() - start

print(f"Python list: {list_time:.4f} seconds")
print(f"NumPy array: {numpy_time:.4f} seconds")
print(f"NumPy is {list_time/numpy_time:.0f}x faster!")

Output:

Python list: 0.0523 seconds
NumPy array: 0.0013 seconds
NumPy is 40x faster!

That's not a typo. NumPy is typically 10-100x faster than pure Python for numerical operations.

Installing NumPy

# Using pip
pip install numpy

# Using conda
conda install numpy

# Import NumPy (standard convention)
import numpy as np

# Check version
print(np.__version__)

Your First NumPy Array

Think of arrays as containers that hold numbers in organized rows and columns.

Creating Arrays

import numpy as np

# From a Python list
my_list = [1, 2, 3, 4, 5]
my_array = np.array(my_list)

print(f"Python list: {my_list}")
print(f"NumPy array: {my_array}")
print(f"Type: {type(my_array)}")

# Shorthand
arr = np.array([1, 2, 3, 4, 5])
print(f"\nArray: {arr}")

Understanding Array Shapes

# 1D array (like a single row)
arr_1d = np.array([1, 2, 3, 4])
print(f"1D array: {arr_1d}")
print(f"Shape: {arr_1d.shape}")  # (4,) means 4 elements

# 2D array (like a table)
arr_2d = np.array([[1, 2, 3],
                   [4, 5, 6]])
print(f"\n2D array:\n{arr_2d}")
print(f"Shape: {arr_2d.shape}")  # (2, 3) means 2 rows, 3 columns

# 3D array (like stacked tables)
arr_3d = np.array([[[1, 2], [3, 4]],
                   [[5, 6], [7, 8]]])
print(f"\n3D array:\n{arr_3d}")
print(f"Shape: {arr_3d.shape}")  # (2, 2, 2)

Quick Array Creation Functions

# Array of zeros
zeros = np.zeros(5)
print(f"Zeros: {zeros}")

# 2D zeros
zeros_2d = np.zeros((3, 4))  # 3 rows, 4 columns
print(f"\nZeros 2D:\n{zeros_2d}")

# Array of ones
ones = np.ones(5)
print(f"\nOnes: {ones}")

# Array with range of numbers
range_arr = np.arange(10)  # 0 to 9
print(f"\nRange 0-9: {range_arr}")

range_arr = np.arange(5, 15)  # 5 to 14
print(f"Range 5-14: {range_arr}")

range_arr = np.arange(0, 10, 2)  # 0 to 9, step by 2
print(f"Even numbers: {range_arr}")

# Evenly spaced numbers
spaced = np.linspace(0, 10, 5)  # 5 numbers from 0 to 10
print(f"\nLinspace: {spaced}")

# Random numbers
random = np.random.rand(5)  # 5 random numbers between 0 and 1
print(f"\nRandom: {random}")

random_int = np.random.randint(1, 100, size=10)  # 10 random integers
print(f"Random integers: {random_int}")

Accessing Array Elements

Just like lists, but more powerful.

Indexing

arr = np.array([10, 20, 30, 40, 50])

# Access single element
print(f"First element: {arr[0]}")
print(f"Last element: {arr[-1]}")
print(f"Third element: {arr[2]}")

# 2D array indexing
arr_2d = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print(f"\nFull array:\n{arr_2d}")
print(f"Element at row 0, col 1: {arr_2d[0, 1]}")  # 2
print(f"Element at row 2, col 2: {arr_2d[2, 2]}")  # 9

# Get entire rows or columns
print(f"\nFirst row: {arr_2d[0]}")
print(f"Second column: {arr_2d[:, 1]}")  # : means "all rows"

Slicing (Getting Multiple Elements)

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Get range [start:end]
print(f"Elements 2 to 5: {arr[2:6]}")  # Note: end index not included
print(f"First 5 elements: {arr[:5]}")
print(f"Last 3 elements: {arr[-3:]}")
print(f"Every other element: {arr[::2]}")

# 2D slicing
arr_2d = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

print(f"\nFirst 2 rows, last 2 columns:")
print(arr_2d[:2, -2:])

Basic Array Operations

The fun part—actually doing math!

Arithmetic Operations

arr = np.array([1, 2, 3, 4, 5])

# Add/subtract/multiply/divide with numbers
print(f"Original: {arr}")
print(f"Add 10: {arr + 10}")
print(f"Multiply by 2: {arr * 2}")
print(f"Divide by 2: {arr / 2}")
print(f"Power of 2: {arr ** 2}")

# Operations between arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([10, 20, 30])

print(f"\narr1: {arr1}")
print(f"arr2: {arr2}")
print(f"arr1 + arr2: {arr1 + arr2}")
print(f"arr1 * arr2: {arr1 * arr2}")

Comparison Operations

arr = np.array([1, 2, 3, 4, 5])

# Create boolean arrays
print(f"Array: {arr}")
print(f"Greater than 3: {arr > 3}")
print(f"Equal to 3: {arr == 3}")
print(f"Less than or equal to 2: {arr <= 2}")

# Use comparisons to filter
filtered = arr[arr > 3]
print(f"\nElements > 3: {filtered}")

even = arr[arr % 2 == 0]
print(f"Even numbers: {even}")

Essential Functions for Data Analysis

The bread and butter of data analysis.

Statistical Functions

data = np.array([23, 45, 67, 12, 89, 34, 56, 78, 90, 21])

print("Data:", data)
print(f"\nMean (average): {np.mean(data)}")
print(f"Median (middle value): {np.median(data)}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Variance: {np.var(data):.2f}")

print(f"\nMinimum: {np.min(data)}")
print(f"Maximum: {np.max(data)}")
print(f"Range: {np.max(data) - np.min(data)}")

print(f"\nSum: {np.sum(data)}")
print(f"Product: {np.prod(data)}")

Working with 2D Data (Like Spreadsheets)

# Student grades: rows=students, columns=subjects
grades = np.array([[85, 92, 78],   # Student 1
                   [90, 88, 95],   # Student 2
                   [76, 82, 80],   # Student 3
                   [92, 95, 89]])  # Student 4

print("Grades table:")
print(grades)

# Statistics for each student (across columns)
print(f"\nEach student's average:")
print(np.mean(grades, axis=1))

# Statistics for each subject (across rows)
print(f"\nEach subject's average:")
print(np.mean(grades, axis=0))

# Overall statistics
print(f"\nOverall average: {np.mean(grades):.2f}")
print(f"Highest grade: {np.max(grades)}")
print(f"Lowest grade: {np.min(grades)}")

Reshaping Arrays

Change how your data is organized.

# Start with 1D array
arr = np.arange(12)
print(f"Original: {arr}")

# Reshape to 2D
arr_2d = arr.reshape(3, 4)  # 3 rows, 4 columns
print(f"\nReshaped to 3x4:\n{arr_2d}")

# Reshape to different dimensions
arr_2d = arr.reshape(4, 3)  # 4 rows, 3 columns
print(f"\nReshaped to 4x3:\n{arr_2d}")

# Flatten back to 1D
flat = arr_2d.flatten()
print(f"\nFlattened: {flat}")

# Transpose (flip rows and columns)
transposed = arr_2d.T
print(f"\nTransposed:\n{transposed}")

Practical Example: Analyzing Sales Data

Let's put it all together with a real-world example.

# Sales data: [Product A, Product B, Product C, Product D]
# Each row is a different month
sales = np.array([
    [120, 145, 98, 167],   # January
    [135, 152, 103, 178],  # February
    [142, 148, 110, 185],  # March
    [155, 160, 115, 190],  # April
    [168, 172, 122, 195]   # May
])

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May']
products = ['Product A', 'Product B', 'Product C', 'Product D']

print("Sales Data (units sold):")
print(sales)

# Total sales per month
monthly_totals = np.sum(sales, axis=1)
print("\nTotal sales per month:")
for month, total in zip(months, monthly_totals):
    print(f"  {month}: {total} units")

# Total sales per product
product_totals = np.sum(sales, axis=0)
print("\nTotal sales per product:")
for product, total in zip(products, product_totals):
    print(f"  {product}: {total} units")

# Best and worst performing products
best_product_idx = np.argmax(product_totals)
worst_product_idx = np.argmin(product_totals)

print(f"\nBest performer: {products[best_product_idx]} ({product_totals[best_product_idx]} units)")
print(f"Worst performer: {products[worst_product_idx]} ({product_totals[worst_product_idx]} units)")

# Average sales per product
print("\nAverage monthly sales per product:")
for product, avg in zip(products, np.mean(sales, axis=0)):
    print(f"  {product}: {avg:.1f} units/month")

# Growth: compare last month to first month
growth = ((sales[-1] - sales[0]) / sales[0]) * 100
print("\nGrowth from January to May:")
for product, g in zip(products, growth):
    print(f"  {product}: {g:.1f}%")

Sorting and Finding Elements

data = np.array([45, 23, 67, 12, 89, 34, 56])

# Sort array
sorted_data = np.sort(data)
print(f"Original: {data}")
print(f"Sorted: {sorted_data}")

# Find indices of sorted elements
sort_indices = np.argsort(data)
print(f"Sort indices: {sort_indices}")
print(f"Using indices: {data[sort_indices]}")

# Find where condition is true
above_50 = np.where(data > 50)
print(f"\nIndices where > 50: {above_50[0]}")
print(f"Values > 50: {data[above_50]}")

# Unique values
data_with_dupes = np.array([1, 2, 2, 3, 3, 3, 4])
unique = np.unique(data_with_dupes)
print(f"\nWith duplicates: {data_with_dupes}")
print(f"Unique values: {unique}")

Combining Arrays

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Stack vertically (rows)
v_stacked = np.vstack([arr1, arr2])
print("Vertical stack:")
print(v_stacked)

# Stack horizontally (columns)
h_stacked = np.hstack([arr1, arr2])
print(f"\nHorizontal stack: {h_stacked}")

# Concatenate
concatenated = np.concatenate([arr1, arr2])
print(f"Concatenated: {concatenated}")

Practical Example: Grade Calculator

# Student exam scores
exam1 = np.array([85, 90, 78, 92, 88])
exam2 = np.array([88, 85, 82, 95, 90])
exam3 = np.array([90, 92, 80, 93, 91])
homework = np.array([95, 88, 85, 90, 92])

# Combine all scores
all_scores = np.vstack([exam1, exam2, exam3, homework])
print("All scores:")
print(all_scores)

# Calculate final grades (weighted average)
# Exams: 25% each, Homework: 25%
weights = np.array([0.25, 0.25, 0.25, 0.25])

# Weighted average for each student
final_grades = np.average(all_scores, axis=0, weights=weights)

print("\nFinal grades:")
for i, grade in enumerate(final_grades, 1):
    print(f"  Student {i}: {grade:.2f}")

# Grade distribution
print(f"\nClass average: {np.mean(final_grades):.2f}")
print(f"Highest grade: {np.max(final_grades):.2f}")
print(f"Lowest grade: {np.min(final_grades):.2f}")

# Letter grades
def get_letter_grade(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    elif score >= 60:
        return 'D'
    else:
        return 'F'

print("\nLetter grades:")
for i, grade in enumerate(final_grades, 1):
    letter = get_letter_grade(grade)
    print(f"  Student {i}: {grade:.2f} ({letter})")

Random Numbers for Data Science

NumPy's random module is essential for simulations and testing.

# Set seed for reproducibility
np.random.seed(42)

# Random floats between 0 and 1
random_floats = np.random.rand(5)
print(f"Random floats: {random_floats}")

# Random integers
random_ints = np.random.randint(1, 100, size=10)
print(f"Random integers (1-99): {random_ints}")

# Normal distribution (bell curve)
normal_data = np.random.randn(1000)
print(f"\nNormal distribution:")
print(f"  Mean: {np.mean(normal_data):.4f}")
print(f"  Std: {np.std(normal_data):.4f}")

# Random choice from array
fruits = np.array(['apple', 'banana', 'orange', 'grape'])
random_fruit = np.random.choice(fruits)
print(f"\nRandom fruit: {random_fruit}")

# Shuffle array
arr = np.arange(10)
np.random.shuffle(arr)
print(f"Shuffled: {arr}")

Common Mistakes to Avoid

Mistake 1: Forgetting Array Shape

# Shape matters!
arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[1, 2, 3]])

print(f"1D shape: {arr_1d.shape}")  # (3,)
print(f"2D shape: {arr_2d.shape}")  # (1, 3)

# They look similar but behave differently

Mistake 2: Modifying Views Instead of Copies

original = np.array([1, 2, 3, 4, 5])
view = original[1:4]  # This is a VIEW, not a copy
view[0] = 999

print(f"Original: {original}")  # Changed!

# To avoid this, make a copy
original = np.array([1, 2, 3, 4, 5])
actual_copy = original[1:4].copy()
actual_copy[0] = 999
print(f"Original (with copy): {original}")  # Unchanged

Mistake 3: Comparing Arrays Wrong

arr1 = np.array([1, 2, 3])
arr2 = np.array([1, 2, 3])

# WRONG: Returns array of booleans
# if arr1 == arr2:  # This causes error!

# RIGHT: Use np.array_equal()
if np.array_equal(arr1, arr2):
    print("Arrays are equal!")

Quick Reference Cheat Sheet

# Creation
np.array([1, 2, 3])          # From list
np.zeros(5)                   # [0, 0, 0, 0, 0]
np.ones(5)                    # [1, 1, 1, 1, 1]
np.arange(10)                 # [0, 1, 2, ..., 9]
np.linspace(0, 10, 5)         # 5 numbers from 0 to 10
np.random.rand(3, 3)          # 3×3 random array

# Info
arr.shape                     # Dimensions
arr.dtype                     # Data type
arr.size                      # Total elements
arr.ndim                      # Number of dimensions

# Operations
arr + 10                      # Add to all
arr * 2                       # Multiply all
arr > 5                       # Boolean array

# Statistics
np.mean(arr)                  # Average
np.median(arr)                # Middle value
np.std(arr)                   # Standard deviation
np.min(arr)                   # Minimum
np.max(arr)                   # Maximum
np.sum(arr)                   # Total sum

# Indexing
arr[0]                        # First element
arr[-1]                       # Last element
arr[2:5]                      # Slice
arr[arr > 5]                  # Filter

# Reshaping
arr.reshape(2, 3)             # New shape
arr.flatten()                 # To 1D
arr.T                         # Transpose

Your Next Steps

Congratulations! You now understand NumPy fundamentals. Here's what to learn next:

Practice - Work with real datasets (CSV files, Excel)
Pandas - Built on NumPy, makes data analysis even easier
Matplotlib - Visualize your NumPy arrays as charts
Machine Learning - Use NumPy with scikit-learn

Remember: Every data scientist uses NumPy daily. You've just learned the foundation of the entire data science ecosystem.

Practice Exercises

Try these on your own:

Create an array of 100 random numbers and find mean, median, std
Simulate dice rolls: roll two dice 1000 times, calculate average
Create a 5×5 multiplication table using NumPy
Analyze temperature data: create 30 random temperatures, find average, max, min
Calculate grades: given test scores, compute weighted averages