Module 9 : Rows and Columns Explained.
1. what are rows and columns?
Row: a horizontal sequence of items. In a table, one row usually represents one record, observation, or vector.
Column: a vertical sequence of items. A column usually represents one feature, variable, or attribute of each record.
Key conceptual differences:
Rows ≈ instances/records (e.g., one user, one sample).
Columns ≈ features/attributes (e.g., age, name, temperature).
Contexts where rows/columns appear:
Matrix / Linear algebra — rows and columns are numeric arrays; linear operations often treat rows as vectors or columns as vectors.
Spreadsheet (Excel/Google Sheets) — grid of cells; rows often indexed by numbers, columns by letters.
Relational databases (SQL) — tables; rows are tuples, columns are attributes.
DataFrames (Pandas, R) — hybrid: labelled rows and columns with built-in operations.
HTML — <tr> is a table row, <td>/<th> are cells (columns).
2. Fundamental operations & methods (concepts)
2.1 In matrices (math)
Row vector: 1 × n matrix.
Column vector: n × 1 matrix.
Transpose Aᵀ: flips rows and columns.
Row operations: swap rows, multiply row by scalar, add scalar multiple of one row to another (used in Gaussian elimination).
Column operations: analogous to row operations but on columns.
Row space / Column space: linear span of row vectors / column vectors.
Rank: dimension of row space = dimension of column space.
Why it matters: linear independence, solving systems Ax = b (use rows/columns view to understand elimination vs decomposition).
2.2 In dataframes / spreadsheets
Select a column — get a vector of values (feature extraction).
Select a row — get a record (instance view).
Filtering (row-wise) — select rows matching a condition.
Aggregation (column-wise) — compute mean, sum, count over a column.
Add/remove columns — create derived features (e.g., bmi = weight / height^2).
Pivot / melt (reshape) — convert rows ↔ columns (long vs wide formats).
Group-by (rows grouped by column values) — compute aggregated statistics per group.
Why it matters: data cleaning, feature engineering, analysis.
2.3 In SQL
SELECT columns FROM table — column projection.
WHERE (row filter) — choose rows satisfying predicates.
GROUP BY column(s) — treat rows grouped by values in specified columns for aggregation.
JOINs — combine rows from multiple tables based on column equality (foreign keys).

Indexes — speed row lookup by values of a column.
Why it matters: efficient querying; data integrity and normalization.
2.4 In HTML/CSS
<table>, <tr>, <td>, <th> define rows and cells (columns visually).
Use CSS to style columns (e.g., nth-child selectors) or use <colgroup> for column-level attributes.
3. Concrete code examples
All examples use small, runnable snippets. Read each snippet top-to-bottom.
3.1 Python — numeric matrices with numpy
import numpy as np # Create a 3x4 matrix A = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9,10,11,12]]) # A.shape -> (3,4) # Select second row (index 1) row2 = A[1, :] # -> array([5,6,7,8]) # Select third column (index 2) col3 = A[:, 2] # -> array([3,7,11]) # Transpose AT = A.T # shape (4,3) # Row operations - swap row 0 and row 2 (in-place) A[[0,2], :] = A[[2,0], :] # Row reduction: get row echelon via numpy.linalg (useful: but we show a simple elimination step) # Multiply row 0 by 0.5 (scalar multiplication) A[0, :] = 0.5 * A[0, :]
Explanation: A[i, :] picks row i; A[:, j] picks column j. Transpose swaps rows and columns; many linear algebra algorithms pivot on row operations.
3.2 Python — tabular with pandas
import pandas as pd df = pd.DataFrame({ 'id': [101, 102, 103, 104], 'age': [34, 23, 45, 31], 'score': [88, 92, 75, 85] }) # Select column(s) ages = df['age'] # Series (column) sub = df[['id', 'score']] # DataFrame with 2 columns # Select row by index row0 = df.iloc[0] # first row # Filter rows high = df[df['score'] >= 85] # rows with score >= 85 # Add a column (derived) df['age_group'] = pd.cut(df['age'], bins=[0,30,50], labels=['young','mid']) # Group by a column (rows grouped by age_group) and aggregate over a column agg = df.groupby('age_group')['score'].mean().reset_index()
DataFrames treat columns as named Series (vectors). Row selection can be by position (iloc) or label (loc) if you set an index.
3.3 SQL — table operations (pseudo-SQL)
-- Table: students(id INTEGER PRIMARY KEY, name TEXT, age INTEGER, score INTEGER) -- Project columns SELECT name, score FROM students; -- Filter rows SELECT * FROM students WHERE age BETWEEN 18 AND 25; -- Group (rows grouped by age bucket) + aggregate (column-wise) SELECT age, AVG(score) AS avg_score FROM students GROUP BY age ORDER BY age; -- Join example (rows from students and classes matched by student_id) SELECT s.name, c.class_name FROM students s JOIN enrollments e ON s.id = e.student_id JOIN classes c ON e.class_id = c.id;
SELECT chooses columns; WHERE filters rows; GROUP BY compresses many rows into aggregated rows using column values.
3.4 HTML simple table
<table> <thead> <tr><th>ID</th><th>Name</th><th>Score</th></tr> </thead> <tbody> <tr><td>101</td><td>Alice</td><td>88</td></tr> <tr><td>102</td><td>Bob</td><td>92</td></tr> </tbody> </table>
Each <tr> is a row; table columns implied by <td> order.
4. Exercises
Exercise 1 — Matrix row/column basics
Given matrix
M = [[2, 4, 6],
[1, 3, 5],
[0, 7, 8]]
a) Extract the second row and third column.
b) Compute Mᵀ and show the element at position (1,2) in the transpose.
Answer
a) second row = [1,3,5], third column = [6,5,8].
b) transpose:
Mᵀ = [[2,1,0],
[4,3,7],
[6,5,8]]
Element at (1,2) (0-based indexing) = Mᵀ[1,2] = 7.
Why: Transpose swaps indices: Mᵀ[i,j] = M[j,i].
Exercise 2 — DataFrame filtering & aggregation
Using the df from 3.2, compute:
average score for each age_group
list of ids for students with score > 80
Solution (pandas)
df.groupby('age_group')['score'].mean() df.loc[df['score']>80, 'id'].tolist()
groupby collapses row sets into aggregated rows keyed by the column; loc filters rows then projects id column.
Exercise 3 — SQL row vs column (conceptual)
Which is typically faster: selecting a single column (projection) or full row scan when using a columnar-store database? Why?
In a columnar store, selecting a single column is faster because the engine reads only the column’s storage. In a row-store, reading many columns from many rows is efficient when you need full rows; reading a single column still requires scanning row structures unless there's a projection/columnar layout or index.
5. Row operations and solving linear systems (Gaussian elimination)
Solve Ax = b by row-reduction and verify with numpy.linalg.solve.
Materials: Python, numpy.
Procedure
Create a random invertible 3×3 matrix A and vector b.
Use Gaussian elimination (implement partial pivoting) to convert A|b to row-echelon form and back-substitute to get x.
Compare with numpy.linalg.solve(A,b).
Expected result: The two x vectors should be numerically close (within floating-point tolerance).
Gaussian elimination relies on row operations (swap, scale, add) to eliminate variables. Partial pivoting reduces numerical error.
2 Column operations and PCA (feature axes)
Use columns as features to compute PCA (principal components) and show how columns are rotated (linear combination of original columns).
Materials: Python, numpy / scikit-learn (or implement SVD).
Procedure
Load or simulate a dataset with 4 columns (features).
Standardize each column (column-wise mean subtraction, divide by std).
Compute covariance matrix (columns determine covariance matrix).
Compute eigenvectors (principal axes) — these are linear combos of original columns.
Project rows (observations) onto the first two principal components and plot.
Expected result: The PCA axes align with directions of greatest column variance; projecting rows shows clusters if present.
PCA treats columns as coordinate axes; operations are performed on column covariance.
3 Data query performance (SQL index)
See how adding an index on a column speeds up row lookups.
Materials: SQLite or PostgreSQL, table with ~100k rows.
Procedure
Create table users(id serial primary key, email text, age integer).
Insert 100k synthetic rows.
Run timed query: SELECT * FROM users WHERE email = 'some@example.com' — record time.
Create index on email: CREATE INDEX idx_email ON users(email);
Repeat query and compare times.
Expected result: Query time after index will be much smaller for selective lookups.
Indexes allow locating rows by column value without scanning the entire table.
6. principles and pitfalls
6.1 Row-major vs column-major memory layout
Row-major (C, numpy default): consecutive elements of a row are contiguous in memory.
Column-major (Fortran, R by default): consecutive elements of a column are contiguous.
Implication: Access patterns matching memory layout are faster due to cache locality. For example, iterating across columns inside a loop for a row-major array is slower.
6.2 Mutability & views vs copies
In libraries like NumPy, selecting a column may return a view (shared memory) or a copy depending on operation; be aware when mutating.
Pandas Series are views sometimes but often copies — use .loc carefully for assignment to avoid SettingWithCopyWarning.
6.3 Wide vs long (rows vs columns) data shapes
Wide: many columns (features). Pros: fast to index by row, easier to compute per-row operations. Cons: can be sparse, memory heavy.
Long: fewer columns, more rows (observations). Pros: easier to group, plot, and perform statistical operations. Many libraries prefer a tidy (long) format.
6.4 Database normalization (columns design)
Columns should be atomic and represent single attributes (1NF). Avoid repeating groups of columns (e.g., phone1, phone2) — use child tables (rows in another table).
Normalization reduces redundancy but can increase joint complexity.
7. Exercises
Matrix rank problem: 3×3 matrix, determine if rows are linearly independent.using row-reduction.
Pandas reshape: Convert a wide table of quarterly sales columns Q1, Q2, Q3, Q4 into long format with columns quarter and sales. (Implement with melt.)
SQL aggregation: query to return top 3 categories by average rating from a products table.
Performance profiling: Measure time to compute column means of a 2D NumPy array using loops vs vectorized operations. Report results and differences.
should include code and a short explanation of why results make sense.
8. Research
Relational model — foundational concept of rows as tuples and columns as attributes (see original works on relational algebra / Codd for deep theoretical background).
Linear Algebra — for row-space/column-space/rank, use a standard linear algebra textbook (e.g., Gilbert Strang’s introductions are approachable).
Database internals — chapters on indexing, row/column stores, and storage engines give insight into performance tradeoffs.
DataFrame internals — read pandas documentation on indexing, view vs copy semantics, and chunked memory layout.
Numerical computing — papers/articles on cache locality, row-major vs column-major, and vectorization for performance.
9. checklist
Prefer vectorized operations (NumPy/Pandas) over explicit Python loops — they operate column-wise much faster.

Mind memory layout: when doing heavy numeric work, choose array layout to match your access pattern (row-major vs column-major).
Indexing for DBs: create indexes on columns used in WHERE/JOIN/GROUP BY frequently; avoid excessive indexing.
When to use row vs column stores:
OLTP (transactional): row-store is typical (read/write whole records).
OLAP (analytical): column-store excels (aggregations over columns).
Be explicit about dtypes in pandas and SQL to save memory (e.g., use category dtype for low-cardinality columns).
Avoid nullable or mixed-type columns when possible — they reduce performance and complicate downstream operations.
10. (integrative)
Build a small ETL pipeline that loads CSV sensor data, cleans it, reshapes wide→long, computes per-sensor statistics (column-wise), and stores normalized results in a SQL table.
steps:
Read CSV and DataFrame.
Inspect columns and rows (df.shape, df.head()).
Clean: drop rows with too many NaNs (row-level filtering); fill missing values per-column with column-wise strategy.
Reshape melt if sensors are wide.
Compute per-sensor aggregates (mean, std, min, max) — column-wise.
Create SQL tables and insert aggregated rows (each row = sensor stats).
Create an index on sensor_id.
This project row filtering, column aggregation, reshape (rows↔columns), and storing aggregated rows into a database.
11. Quick reference
Access row i in NumPy: A[i, :]. Column j: A[:, j].
Access row/col in pandas: df.iloc[i] (row), df['col'] (column).
SQL: SELECT col FROM table WHERE condition (columns = projection, WHERE filters rows).
Transpose swaps rows & columns: A.T.





No comments:
Post a Comment