TIL-notes

Statistics: Data and Sampling Distributions

2026-02-03T00:00:00+00:00

Today I learned about Sampling and their distributions. When There is huge dataset with unknown distribution, we do not have a good idea what to expect in the patterns in the data. We then take some samples(a subset of the dataset) and make inferences about the true dataset. The true dataset is called as Population and the sample of a population can be represented by $N(n)$.

Sometimes quality is better than quantity. Smaller samples give better description about the data that the population itself. This is one of the reason why a better sampling is required.

Todays topics:

Sampling
Sampling distribution of a statistic:
- Central Limit Theorem:
- Standard Error:

Sampling

Bias:
- Statistical bias refers to measurement errors or sample errors that are produced by either during measurement or during sampling.
- It is important to differentiate between these two biases and several methods were developed to reduce this sampling bias. One of the method is Random Sampling
The most basic sampling is Random sampling. We take a random member of a population to the sample. All the members in the population has an equal chance of getting selected. In selecting the next sample, we can Replace the picked up member back to the population (called as Replacement) or we can leave it out if population for the next samples (without Replacement). -To achioeve a better representation of the population through sampling, it could be useful to look at questions like:
- Do we need to startify the data into smaller sub groups of data with similar properties before sampling?
- Does assigning weights to different stratified subsets to achieve better sample sizes?
Sample Means vs Population Means:
- The mean of the sample($\bar{x}$) is often different from the mean of the population ($\bar{\mu}$).This is important because the variation of the sample means across different samples can give a lot of information about the population.
A normal procedure to model a dataset is:
- To specify a hypothesis (more about this later)
- Conduct a well designed Experiment to test the hypothesis.
- Have results
But this is not what happens in general. The Person modeling the dataset may go on an extensive search through the dataset in search of patterns and sooner or later, there comes a pattern. But the question is, is this really a pattern or just a Data snooping. Another interesting phenomenon is Data Snooping, it occurs when the data is extensively hunted through to find somethong interesting. This leads to
Selection Bias:
- When a Datascientist or a statistician selects the samples selectively so that leads to a conclusion is called selection bias.

Sampling distribution of a statistic:

As we discussed above, the distribution of the mean of a sample is important for infering information about the population. So in this section we look closer at sampling distributions
Sampling metric is a metric calculated foa a sample
Sampling distribution is the frequency distribution of a sample statistic over several samples. While data distribution is the frequency distribution of the individual data points.
We can now, try to estimate this sampling distribution to see how far the sample statistic is from the population statistic (eg: mean)
About Sampling Distribution:
- Sampling distribution of samples resembles bell curve.
- The above statement is governed by Central Limit Theorem (CLT)

Central Limit Theorem:

It states that the means drawn from several samples will resemble a bell curve given:
- The sample size is large enough
- The values of data doesnt depart from normality too much.
- For a data scientist, CLT is not so important but for traditional statistics it is a very important theorem as it lays foundational grounds for Hypothesis testing, confidence intervals.
- A better method called boot strap is always available for estimating sampling distribution that does not assume that the sampling dist forms bell curves. (more on bootstrap later)

Standard Error:

Population: You have a big, unknown group.
Samples/Resamples: You take many samples of size $n$.
Sample Metric: You calculate the mean ($\bar{x}$) for every single sample.
Sampling Distribution: You plot all those means. Because of the CLT, they form a bell curve
Now we can calculate the standard deviation of this bell curve to find the spread of the curve or the variability of the sample means
But, the problem is we do not have to take several thousands of samples to estimate the sampling distribution. We can take one sample and then we can try to estimate ‘Standard Error’ for a sample and scatter it to remaining samples.
Because, the shape of the curve is ‘bell curve’, we need two inputs for the normal distribution estimation. The mean and the standard deviation. This standard deviation is here called as ‘Standard Error’
Standard Error is a single metric to measure the variability of sampling distribution.
It is given by $SE = \frac{s}{\sqrt{n}}$ ; s = standard deviation
From the above equation, we can deduce that that:
- As the sample size(n) increases, the Standard error decreases.
- This relation ship btwn SE and Sample size is reffered to as ‘Square root of n’ rule.
- When the SE to go down by a factor of 2 the sample size has to be increased by 4times.
While this method is useful in estimating SE, it has several con’s:
- Assumption of bell curve for the sampling dist
- The standard deviation of the sample is used for the estimation of SE. if the sample taken is by chance abnormal, this could mess up the SE estimation.
- This method only works for the metric mean.
This method of taking samples and estimating the SE is not efficient in statistical terms(not resourceful). There is another effective method called ‘BOOT STRAP’ method of resampling.

Statistics: Exploratory data analysis

2026-02-01T00:00:00+00:00

I have started a new book this month that is named ‘Practical statistics for Data Scientists’.
Todays topics are mostly from that book. The first chapter is Exploratoty data analysis. Although I know most of these concepts, I would still like to document the learnings for the sake of completness.
The first steps for any project is analysing the raw data, filtering, and structure the data into a ‘more easy for the machine or algorithm’ way to train a statistical model.
The structured data could be either
1. Numeric: Data Expressed as Numericals
  - Continuous :
  - Discrete
2. Categorical: The data takes a set of values (ex types of tv screens: Lcd:1; plasma:2,..)
  - Binary: 0 or 1
  - Ordinal: Ordered categories.
Rectangular data:
- For analysis, the data is mostly organised in a rectangular frame (2D matrix) of reference in most of the softwares like spread sheets ot databases.
- The columns are called as Features that are used to predict a value called target variable.
  - To predict weather, we could used features like humidity, windspeed, sunny..
- The rows are the records or observations.
- The data could be Non rectangular, like time series data, graph datastructures and so on.. In this blog rectangular data is focused.

Todays topics:

Estimates of Location:
Estimates of Variability:
Exploring the Data Distribution:
Correlation:
Exploring two or more variables:

Estimates of Location:

When there are 1000’s of observations for a feature, It might be a good start for the anaysis to know where most of the observations lie. For example: most of the observations for Humidity is around $25^0$.
The metrics used to estimate the location for a feature are as follows:
Mean:
- The avreage value of all the observations
- $mean(\bar{x} = \frac{\sum_1^n{x_i}}{n})$
- Note: the mean is very sensitive to outliers(extreme values in the observations). So there are other better metrics for estimating location.
- Trimmed Mean:
  - Before calculating the mean, the extreme values are trimmed/dropped.
  - Instead of ‘n’ observations we substract ‘p’ largest and ‘p’ smallest values in the observations
  - This essentially reduces the sensitivity to extreme values.
- Weighted Mean:
  - We can multiply each datapoint($x_i$) with a specific weight to tweek the individual influence of a datapoint on the final value.
  - $\bar{x_w} = \frac{\sum_1^n{w_ix_i}}{\sum_1^n{w_i}}$
  - This method is useful in cases where the proportion of observations for two categories are not similar. We can assign a higher weight to the group with less number of observations. This reduces the bias towards the group with larger data points/observations. -Median:
- The middle value in a sorted data is called ti median of the data.
- When there are even number of datapoints, we take the average of both the middle values
- Median is robust to outliers

Outliers are sometimes informative and sometimes nuisance. Anamoly detection is used to determine the outliers, I will get to this in a later blog.

Estimates of Variability:

In the next step of exploring the data, one might be interested in finding how spread out the data is.
The various metrics for measuring variability are:
- Deviation:
  - The difference between Observed and the estimate of location.
  - We can then take the mean of these deviations for the absolute values(without the sign of the deviation) from the mean is called mean absolute deviation
  - $Mean\ absolute\ deviation = \frac{\sum{|x_i-x|}}{n}$
- Variance:
  - Variance is the average of the squared deviations.
  - Following the Variance is the important metric standard deviation. It is the square root of the variance.
  - $Variance = s^2 = \frac{\sum{(x_i-\bar{x})^2}}{n-1}$
  - the denominator is not n because of something called ‘degrees of freedom’. As n is a large number, when taken ‘n-1’ as denominator it would not make much difference and also it reduces the redundency in data during the calculation of variance.
  - Then the $Standard\ Deviation$ is given by $s = \sqrt{variance}$
The standard deviation is mostly used metric because, it lies in the same dimension as the data.
- These metrics are not robust to outliers, A more robust metric would be ‘Median Absolute Deviation from the median’.
  - $MAD = median(|x_1-m|, |x_2-m|,…|x_n-m|)$
- Percentile Extimate:
  - Percentile is a value below which a certain percentage of the datafalls. For example, $p^th$ percentile is a value for which atleast p percentage of the values has observed value lesser this value and (100-p) percentile takes on value greater than this value.
  - For example, 90th percentile for height menas that I am taller than 90% of the people in the dataset.
  - This is where we have the concept of Inter Quartile Range(IQR) The dataset is divided int0 4 quartiles.
  - 1st Quartile- 25th percentile
  - 2nd Quartile- 50th percentile
  - 3rd Quartile- 75th percentile
  - 4th Quartile- 100th percentile
  - The IQR is then $Q_3-Q_1$. Basically we consider the 50% of data in the middle of the dataset.

Exploring the Data Distribution:

Instead of summarizing the data into a single number to make some initial sense of the data, we can see visually how the data is distibuted.
Several plots are developed by statistians to visualise the spread of the data.
Box plots:
- These are based on the percentile we discussed above.
- The plot indicates outliers, and the 4 quartiles along with the IQR of the data.
Frequency tables, histograms and density plots:
- The data is classified into bins and the count in each bin is called the frequency of the bin.
- We can use a histogram to visualise the data.
- For a density plot we can use a ‘kernal density estimate’ to smoothen the histogram

Correlation:

In Exploratory data analysis, one of the important concept is to check for correlation.
Correlation is measuring the relation between two predictors or between predictor and target.
Say two features x,y. How does y change when the x changes. This measure is called correlation.
Correlation coefficient is metric used to measure the correlation between two features. Its value lies between (-1 and 1). -1 indicates lower correlation and 1 indicating higher correlation between features.
- Pearson Correlation Coefficient: $\frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{(n-1)s_xs_y}$
When the relation between features are not linear, correlation may not be an important metric.
This above method of correlation is not robust to outliers.

Exploring two or more variables:

Some of the plot can be used to explore two or more variables simultaneously.
1. Contingency table
2. Hexagonal binning
3. Contour plots
4. Violin plot.

Vector calculus

2026-01-23T00:00:00+00:00

Todays topics:

Chain rule:
Vector fields and Gradients:
Directional derivatives:
Jacobian and Hessian Matrices:
- Jacobian Matrix:
- Hessian Matrix:

Chain rule:

Composite functions: They are functions that are composed of another functions, for example: $F(x) = f(g(x))$. Chain rule can be used to derivate these types of functions
The above is a simple case, we can just say that the internal function be like y = $f(x)$ then $x= g(t)$. So finally we get $\frac{dy}{dt} = \frac{dy}{dx} \cdot \frac{dx}{dt}$
There are other cases depending on the number of variables in the function.
1. Case 1: $z = f(x,y), x = g(t), y = h(t)$, we need to find the ratio $\frac{dz}{dt}$
- Since the connection between independent and dependent variable is direct, we can use the form similar to above
- $\frac{dz}{dt}=\frac{\partial{f}}{\partial{x}} \cdot\frac{dx}{dt} + \frac{\partial{f}}{\partial{y}} \cdot \frac{dy}{dt}$
  1. Case 2: $z=f(x,y);x=g(s,t)and y=h(s,t)$ Here the internal function has two variables in them, so the goal here becomes finding two ratios $\frac{\partial{z}}{\partial{s}} \& \frac{\partial{z}}{\partial{t}}$
- We can use a tree diagram to find all the components of differential function.
- The first two branches becomes the immediat variables inside the main function. Then in the second layer the internal functions can be further divided into their own independent variables.
- At the end we add allthe partial derivatives together which gives the final ratio.

Vector fields and Gradients:

A gradient is fancy word for a derivative or rate of change of function. A vector that points in the direction of greatest(steepest) change of a function
A vector field is also called as ‘gradient vector field. So , a vector field in 2D or 3D is a function $\overrightarrow{F}$. It assigns a 2D or 3D vector to each point in the space.
Scalar functions and Vector functions!! When the function taht outputs scalars then it is a scalar function. When the output is vector, then it is a vector function.
Now For a scalar function $f(x,y,z)$ the vector function becomes $\nabla{f} =
$. The terms inside the vector function are the partial derivatives of the function f wrt (x,y,z). They are also called as vector components.
- \[\nabla{f}= \begin{bmatrix} f_x \\ f_y \\ f_z \end{bmatrix} = f_x\overrightarrow{i}+f_y\overrightarrow{j}+f_z\overrightarrow{z}\]
An example of vector field is the flow of fluid inside a pipe
An example take the function $f(x,y) = x^2 sin(5y)$. Then the vector field becomes $\nabla{f} = <2xsin(5y), 5x^2cos(5y)>$.
We can substitue the x and y values to the vector function to get the directional vector for that point in the space representing the biggest change in the function.

Directional derivatives:

A directional derivative is the rate of change of function in a given direction. In another words, given a direction, the quantity of change of functions in given by directional derivative.
In the previous chapter we saw that the maximum change in the function is given by the gradient, so it would be common sense to think that the directional derivative would be maximum when the directional derivative points towards the gradient.
We can prove this using dot product properties:
- Given a function f(x,y) and a unit vector $\overrightarrow{u} = $ then the directional derivative is represented as $D_\dot{u}f(x,y)$
- The directinal derivative is given as $D_\dot{u}f(x,y) = f_x(x,y)a + f_y(x,y)b \implies . \iff \begin{bmatrix} f_x \\ f_y \end{bmatrix} \cdot \begin{bmatrix} a \\ b \end{bmatrix}$
- This above expression boils down to the dot product between the gradient vector function and unit vector for given direction.
- $D_\dot{u}f(x,y) = \nabla{f} \cdot \overrightarrow{u}$
- Then the max value for the $D_\dot{u}f(x,y)$ occurs when the angle between the two vectors is 0. cause $cos(0)=1$ in other words, when the directional vector points towards unit vector.

Jacobian and Hessian Matrices:

When the function has multiplie independent variables, it could be hard to find the direction of the steepest change in the function.
One has to look at all the directions(towards all the variables). This is where we get the Jacobian and Hessian Matrices
Jacobian and Hessian Matrices describe the ‘slope’ and ‘curvature’ of the multi variate function.

Jacobian Matrix:

Jacobian is a matrix composed of the first-order partial derivatives of the multi variate-vector function.
For a function $f(x_1,x_2..x_n) = (f_1, f_2,..f_m)$
The jacobian matrix will have n- columns (as many as the number of vector components) and m- rows (as many as the number of variables)
\[J_f = \begin{bmatrix} \frac{\partial{f_1}}{\partial{x_1}}&\frac{\partial{f_1}}{\partial{x_2}}&.&.&.&\frac{\partial{f_1}}{\partial{x_n}}\\ \frac{\partial{f_2}}{\partial{x_1}}&\frac{\partial{f_2}}{\partial{x_2}}&.&.&.&\frac{\partial{f_2}}{\partial{x_n}}\\ .&. \\ .&. \\ .&. \\ \frac{\partial{f_m}}{\partial{x_1}}&\frac{\partial{f_m}}{\partial{x_2}}&.&.&.&\frac{\partial{f_m}}{\partial{x_n}} \\ \end{bmatrix}\]
Uses of Jacobian matrix:
- To approximate the complex multi variate vector function to a linear flat plane around a point ‘P’. The shape of the function can be complex in different dimensions, to find a value at a point P with much lesser computational costs we approximate a function to a linear function around ‘P’.
- The determinant of the Jacobian (Jacobian) tells if the function is ‘locally invertable’
  - $det(J)\ne0$, The function is locally invertable
  - $det(J)=0$, The function is not invertable
- Apart from these, the Jacobian matrix is helpful in determining critical points. Remember when we equate the first differential to 0 then we are essentially finding the points where the function has a maximum or minimum? It’s the same here, to find these critical points, we equate $J_f(x,y) = 0$ Now solve the system of equations to find the critical points.
- Later these critical points are classified to maximum or minimum with the Hessian Matrix

Hessian Matrix:

Like Jacobian matrix, Hessian Matrix is made of differentials. However, Hessian Matrix contains ‘second order derivatives’.
It gives the ‘curvature of the function’.
Hessian Matrix is a symmetric matrix
For $f(x,y)$, $H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}$
It can be used to find if the point belongs to a pit(Minimum), peak(Maximum) or a mountain passing. For this we find something called Discriminant (D)
- Discriminant (D) = $f_{xx}f_{yy}-f_{xy}^2$
- Then, if $D>0 and f_{xx} <0$ then the P is a local Minimum
- Then, if $D>0 and f_{xx} >0$ then the P is a local Maximum
- Then, if $D<0$ then the P is a Saddle Point. A Saddle point is where one direction curves up and one direction curves down.

Calculus basics

2026-01-20T00:00:00+00:00

It would be nice to revisit calculus while I’m at revisiting the math backgrounds. So, todays blog is related to calculus Todays topic include:

Differentials:
Geometric meaning:
Partial fractions:
Partial Differentiation:
Integration:

Calculus is made up of mostly two symbols:

“d” meaning 'A little bit of', when someting says “dx” it means little bit of x. Usually indefinitly small pieces. Called as Differential
- Remember when $dx$ is itself infinitly small, the higher order terms would be even smaller. So for many calculations the higher order terms are generally omitted to make the calculations simpler.
”$\int$” meaning 'the sum of', when we see “$\int{dx}$” it means sum of all the little pieces of ‘x’. Called as Integral

Differentials:

Suppose there is a ladder that was laid against a wall. Say the reach of the ladder horizontally as ‘x’ and vertically as ‘y’.
What happens to ‘x & y’ when the ladder’s tilt is changed a little bit?
Say the horizontal reach is changed a ‘little bit’ say ‘dx’ and vertical ‘dy’.
The change in vertical reach (dy) might be positive or negative based on the change in the horizontal reach(dx).
The goal of the differential equation is to find the ratio $\frac{dy}{dx}$ (differential coefficient of y wrt x). The change in vertical reach with respect to the change in the horizontal reach!!
$l^2 = x^2+y^2$ type of functions are called Implicit functions that has this dependency implicitly inside the function and are represented by the form $F(x,y)$.
The other form of functions are the explicit functions, like for the ladder example we can say that $y = \sqrt{l^2-x^2}$ as it forms a right angle triangle, in which the variable ‘y’ is called dependent variable which depends on the variable ‘x’(Independent variabe). We can write this in the form $y =F(x)$

Also when a function is differentiated($F(x)$) can be represented as $F’(x)$ equivalent to $\frac{d(F(x))}{dx}$.

This is called as ‘derived function’

Added Constants in the differential equation does not change the end result. Howver Multiplied constants multiply the end result
- Example: $y = x+2$ and $y = 2x$
Sums and differences:
- When we need to find the differential coefficient of the sum of two functions of x, for eg: $y=(x^2+c)+(ax^4+b)$. This is quite simple, just differentiate the two equns one after the other. The ans will be $dy/dx=2x+4ax^3$.
- The solution for $y = u+v$: $\frac{dy}{dx}=\frac{du}{dx}+\frac{dv}{dx}$
Products:
- The product of two equations is not a straight forward thing as we have seen in constants section that a constant multiplied to a function can have an impact on the final result. Eg: $y=(x^2+c)∗(ax^4+b)$
- The solution for $y = u\times v$ will be: $\frac{dy}{dx}= u \cdot \frac{dv}{dx}+ v \cdot \frac{du}{dx}$
Quotients:
- For $y = \frac{u}{v}$, the solution will be: $\frac{dy}{dx}= \frac{(v \cdot\frac{du}{dx}−u \cdot \frac{dv}{dx})}{v^2}$
Successive Differentiation:
- A function can be differentiated succesively., the second time a function is differentiate is called ‘second derived function’ and so on..
They can be represented as before but with two dashes $F’‘(x) = \frac{d(\frac{dy}{dx})}{dx}$
- For example when the distance is differentiated wrt time, we get velocity. Which gives acceleration when differentiated again!! This differentiation wrt time has a word called rate. The rate of change of some thing…
- let ‘y’ be the distance and velocity $\nu = \frac{dy}{dt}$ and acceleration is $a = \frac{d(\frac{dy}{dt})}{dt}$

Geometric meaning:

Well, what is the real importance of differentiation? We can explain this using a graph.
Imagine the graph above shows the curve. Take a small piece of the curve and we can say that dx is the small change in x axis and dy would be the small change in the yaxis. When we find the ration $\frac{dy}{dx}$ when the observed curve is infintly small, we can consider it as a straight line. Now this ration can be treated as the slope of the curve at that point.

So the ratio $\frac{dy}{dx}$ gives the slope of the curve at a particular point in the curve.

This value (differential coefficient) says a lot about how the curve behaves at that point of the curve.
- The value is 1, means that the curve is slopping at $45^0$
- The value greater than 1 meaning the curve has a slope more than $45^0$
- When the value is negative, the slope is curving downwards.
- The value is increasing, the curve slope is increasing
- Similarly, the slope decresing meaning, the curve is gradually approaching horizontal nature(not untill 0)
But an interesting case appears, when this value becomes 0. This means that the curve doesnt have any slope meaning it is horizontal. This horizontal appears only in two cases for a curve. Maxima and a Minima.
Maixma and Minima:
- One of the main reasons to differentiate a function is to findout where or if the curve achieves its maximum value or minimum value. This is an important concept in Engineering and AI to maximize the efficieny of a model.
- So, when modeling the function could be differentiated and equated to zero. Now solve the equation to find the point where the maximum or minimum is present.
- But One problem is we do not know if the point belongs to a maxima or a minima?? It just tells where the slope of the curve is 0
- This is where successive differentiation comes, when find a second derived function, we can determine if the point belongs to a maxima or a minima! How??
  - The value $F’‘(x)$ gives the ‘curvature of the slope’
  - Well, When the slope($F’(x)$) is constant, the value of $F’‘(x)$ becomes zero.
  - When the slope is increasing, the $F’‘(x)$ value increases
  - when the slope is decreasing, the $F’‘(x)$value decreases.
- We can use this sign to determine if the point is a maxima or a minima.
- When the value $F’‘(x)$ is positive, we can say that the point at which $F’(x)=0$ is a Minima
- When the value $F’‘(x)$ is Negative, we can say that the point at which $F’(x)=0$ is a Maxima

Partial fractions:

When the function(fraction) is to complicated to differentiate, we can decompose the fraction into smaller functions and work with them
for example:

Partial Differentiation:

What do we do when the function contains more that one independent variable like $F(u,v)$?
we can now use partial differentiation to differentiate the function.
- We first consider one variable to be constant, say v as constant! Then $dy_v=v.du$
- Similarly next of u is considered constant then $dy_u=u.dv$
- Since the differentiation has been partially performed on the equ, the above equations are also called as partial differentials. They can also be written as:
- $\frac{\partial{y}}{\partial{u}}=v$ and $\frac{\partial{y}}{\partial{v}}=u$
- Substituting these in above equns:
- $dy_v=\frac{\partial{y}}{\partial{u}} \cdot du$ ;and $dy_u=\frac{\partial{y}}{\partial{v}} \cdot dv$
- Since the y depends both on u, v. To achieve total differential we add both the terms above which leads to:
- $dy= \frac{\partial{y}}{\partial{u}}\cdot du+ \frac{\partial{y}}{\partial{v}} \cdot dv$

Integration:

As already said, integral is the sum of. When we say, $\int dx$ then that should make the whole of ‘x’.
Now, we can ask the question if we have the slope of the curve given can we recreate the curve.
For example take a line we have the equation $y = ax+b$, the slope is ‘a’ and the y-intercept ‘b’.
- This slope $\frac{dy}{dx} = a$, this says the small triangles that are under the curve. Now when we stack those triangles together we can form a line. But the intercept is missing we do not have any knowledge about it, where do we place the y-intercept?
- To tackel this intercept we add a constant term ‘C’ after integrating!
- What happens if it is a curve. We then have the slope as a function of ‘x’ like $\frac{dy}{dx}= ax$. As the value of x increases, the slope increases.
- The problem is when the triangles are not small, the curve would be a coarse. If the curve needs to be sommther, the triangles need to be as small as possible.
Integrations can also be used to find the areas under the curve. The area under curve, can be divided into strips. Each strip can be considered as a rectangle and all the strip can be integrated to find the area under the curve. This is where the bounds come.
As a curve can be extended limitless, we find the area of the curve bound by Limits. They were upper and lower bounds. They can be represented as $\int_{x=x_1}^{x=x_2} y.dx$

Matrices

2026-01-16T00:00:00+00:00

Todays topic include:

Matrices
Determinants:
Gaussian Elimination:
Rank:

Matrices

A matrix is a 2d array of elements. The hoizontal elements are combinedly called as rows. Similarly the vertical elements are combindly called as columns. Using these we can represent a matrix like: A matrix of dimension rows * column.
The relation between a vector and a matrix can be shortly put as a vector can be transformed using a matrix. For example when a vector is multiplied by a scalar, the vector shrinks or magnifies. In the same way when a vector is multiplied by a matrix, it undergoes certain transformation.
Example of a matrix with dimension 3*3:
- \[A = \begin{bmatrix} 1&2&3 \\ 4&5&6 \\ 7&8&9 \\ \end{bmatrix}\]
From the above matrix The column vectors becomes $\begin{bmatrix} 1\\4\\7\\ \end{bmatrix}, \begin{bmatrix} 2\\5\\8\\ \end{bmatrix}, \begin{bmatrix} 3\\6\\9\\ \end{bmatrix}$.
The row matrixes for A are: $\begin{bmatrix} 1\\2\\3\\ \end{bmatrix}$ and so on.. Since the row elements are transposed to make it as a vector they are basically the transposed matrixs instead of direct matrices.
Identity matrix (I):
- Where the diagonal elements are 1
Zero Matrix:
- Every element in the matrix is zero
Transpose of a Matrix:
- The transpose if a matrix $A$ is the $A^T$. Where the columns of the matrix A are the rows in matrix $A^T$
Symmetric matrix:
- Where $A = A^T$
Matrix Operations:
- The matrices can perfrom operations like:
  1. Multiplication by scalar
  2. Matrix addition
  3. Matrix-Matrix multiplication
    - When two matrices (mn, jk) are multiplied, it produces another matrix of size (m*k). The matrix operation can only be carried out only if n =j. $(m \times n)\times (j \times k) = (m \times k) \iff n = j $
- Matrix subtarction can subtraction can be performed by using the 2&3 like $C = A+(-1)B$
The dot product can be performed using matrix multiplication:
- \[u \cdot v = u^T\cdot v = (u_1, u_2 ...) \begin{pmatrix}v_1\\v_2\\.\\.\\.\\ \end{pmatrix} = \sum_{i = 1}^n u_iv_i\]
Matrix Inversion:
- When we have an equation to solve: Ax = b, How do we solve it?
- We can use matrix inversion:
  - Check if the matrix is inversible (Only square matrices can have Inversions)
  - Calculate the inverse of matrix A
  - Multiply ‘b’ by the inverse of A $(A^{-1})$
  - Verify the solution
- A Square matrix A is said to be inversible when there exists a matrix $A^{-1}$ such that: $A \cdot A^-1 = I$
- The inverse of a $2\times2$ matrix is given by:
- The denominator is also called as the determinant of the matrix.
The matrix A is said to be linearly independent if $AX=0$. Meaning the solution is trivial.
- A matrix is linearly independent if no column in the matrix can be expressed as a linear combination of other columns. If there is one other solution other than AX=0, the matrix A is linearly dependent
- Basis theorem states that,
  - In a 2D plane two vectors, u,v form a basis if u,v are linearly independent
  - In a 3D space, the three vector u,v,w must be linearly independent to be a basis of the 3d space.
In 2D and 3D spaces, there are some matrices used for Rotation, Shearing and Scaling a vector. When multiplied by these vectors, we can perform the intended operation on the vector But what happens when there are two different coordinate systems?
When there are two different basis (say two different coordinate systems with different bases), we can formulate a matrix such that we can represent any vector in one basis to another basis
- $v = B\hat{v}$
- Where v is the vector in one basis and $\hat{v}$ is vector in another basis. The matrix B can be used to transoform one coordinate to another. This is core concept used in coordinate transformation in robots, image processing and many more.
Orthogonal matrix:
- A matrix Q is said to be orthogonal if its transpose is its inverse matrix.
  - \[Q^{-1}\cdot Q^T =I \iff Q^{-1}=Q^T\]
- So, an orthogonal matrix is a square matrix with the column vectors perpendicular to each other and one unit long.
- An orthogonal matrix will rotate a vector in a 2D space but will not cause shear or strech.

Determinants:

The denominator in the inverse of the matrix is called as determinant of the matrix.
This is very useful for determining the properties of a matrix like if the matrix is invertable (if $det(A)=0$), if a linear equation has a unique solution or no and so on..
- $det(A) = 0$ has only one unique solution
- $det(A) \ne 0$ has many solutions
For a matrix A of dimension $(3 \times 3)$ the determinant is defined by det(A) or $|A|$ and is given by:
- \[det(A) = \begin{vmatrix} a_{11}&a_{12} \\ a_{21}&a_{22} \end{vmatrix} = a_{11}a_{22}-a_{12}a_{21}\]
- similarly we can get the determinant for higher dimension matrices just like the cross product
Some useful properties of determinant: -$det(A) = det(A^T)$
- $det(AB) = det(A)det(B)$
Adjoint Matrix:
- Adjoint matrix is useful for finding the inverse of a matrix without the complex formula as seen before for smaller matrices $2 \times 2; 3 \times 3$
- \[A^{-1} = \frac{1}{det(A)} \cdot adj(A)\]
Cramer’s Rule:
- Crammers rule gives an explicit formulae for solving a system of equations or say a linear equation Ax = y, without needing to find the $A^{-1}$ given that the $det(A) \neg 0$
- All the values of $x$ are then given by just dividing the determinants of a modified matrix ($A_i$) with the determinant of A
- $x_i = \frac{det(A_i)}{det(A)}$
- Where $A_i$ is obtained by replacing the $i^{th}$ column in A with the constants in y
- Therefore: $x_1 = \frac{det(A_1)}{det(A)}; x_2 = \frac{det(A_2)}{det(A)}$

Gaussian Elimination:

Gaussian Elimination is a method to solve the system of linear equations.
As we will see in the following section, that Gaussian elimination can also result in other fruit full outcomes like finding the rank, nullity of the matrix. These are some of the useful properties of the matrix describing how solid the matrix is.
Gaussian Eliminatopn does not chang the rank, column space or null space of the matrix. It just strips the excess information from the matrix
The basic idea of Gaussian Elimination is to do some row operations to make to make the matrix into something called Row Echelon Form (REF) or further Reduced Row Echelon Form (RREF).
- Swapping the rows : We can swap any two rows
- Scaling the rows : A row can be multiplied by a non zero scalar
- Row Addition : Two rows can be added, subtracted to find the pivot element
The REF form of the matrix has all the zero rows at the bottom and non zero rows at the top.
We end the row operations when we found the pivot elements for each non zero row to achieve REF.
For example:
- $A = \begin{bmatrix} 1&2 \\ 3&8 \end{bmatrix}$ can be written into its REF $\begin{bmatrix} \underline{1}&2 \\ 0&\underline{2} \end{bmatrix}$ by performing $\to R_2-(R_1 \times 3)$.
- The underlined elements are the pivot elements of that row. See how the row elements are 0 before the pivot elements?
- It can further be reduced into its RREF by $\to \frac{R_2}{2}$ so that all the pivot elements would be 1, to become $\begin{bmatrix} \underline{1}&2 \\ 0&\underline{1} \end{bmatrix}$
The solution of the Gaussian elimination caould be either
- Has one solution for the system of equations (The lines in the System of equations intersect at 1 point)
- Has No solution (The lines could be parallel lines so, no solution)
- Has infinit solutions (the intersection could be a line, plane,..)

Rank:

Rank is a property of a matrix that truly tells how much information is in the matrix.
It is the number of linearly independent columns or rows in a matrix. Full rank is when the rank of an $n \times n$ is n, if the rank is less than n then it is called rank deficit. A full rank matrix is a solid rank without any redundant informantion. Rank deficit means that the matrix has some redundant information.
The matrix like $\textbf{A} = \begin{bmatrix} 1&2 \\ 2&4 \end{bmatrix}$ has a rank of 1. This matrix when applied to a 2D space, it will collapse everyting to a 1D space. This property is useful in some cases where we need to reduce the diminsionality of vector space.
A rank of the matrix can be found out after transforming the matrix to its REF.
Nullity:
- Also an important concept is the Nullity of the matrix, $Rank(A) + Nullity(A) = number of rows(n)$. As seen above, the number of dimensions that were collapsed during the transformation is the nullity of the matrix.
- Nullity is the dimension of the nullspace of the matrix.
- In datascience this helps to identify the features that are redundant and add no new information for the model during training.
- Null space:
  - The Null space of A is the entire solution set for the $Ax = 0$.
  - To find the Nullity of a matrix, we can use the RREF and in this form, the number of rows with the pivot elements are the Rank of the matrix and the number of rows with no pivot elements are the nullity of the matrix. This should make sense with the above equation.
  - For example for the below REF form of a matrix, there are 2 columns that are either zero or without pivot element. So the rank is 3 and nullity is 2.
  - Therefore the above solution is the nullspace of the matrix A.
Row Space and Column Space:
- The set of all linear combinations of the linearly independent columns or rows of the matrix are called the column space or rowspace of the matrix.
- The number of linearly independent rows or columns are also called as Row rank (rowrank(A)) and Column rank (colrank(A)) and can be otained from the REF. It can be observed that the row rank and column rank are equal to rank of the matrix.
- The columns and rows of the pivot element form the basis of the column space and row space respectively.
- For a product of two matrices $A=BC; rank(A) \le rank(B) \& rank(A) \le rank(C)$.
- These concepts are also used by Singular Value Decomposition(SVD) used in AI and other technologies.

Linear algebra basics

2026-01-15T00:00:00+00:00

In the past couple of weeks, I decided to give my math skills a vist. Namely, Linear Algebra and calculus. These branches of Math are crucial for several fields of Engineering such as Computer vision, Image processing, Ml, AI to name some. Although I will be diving deep into statistics from tommorrow (I got a very nice book called “I’ll tell you in the next blogs”). I would like to write a few blogs about the basic and imortant concepts that I studied(again!!) in the past 10-12 days. Starting with the good old Linear Algebra.

So Linear Algebra to simply put is dealing with geometric shapes. How we represent them mathematically and find their solutions. Example, when there are two lines, we can investigate where is they were intersectig, how do we find if they are really intersecting or they are just parallel lines? One might have heard about terms like Vectors, Scalars, Matrices. These can be considered as the core for Linear algebra. So, why is this important for Ml or AI? Well, These matrices provide a nice way to store the huge data for training the models. More over, neural networks use matrix multiplication to improve their accuracies.

In this post:

Vectors
Dot Product (Scaler Product)
Vector Product (Cross Product)

Vectors

Vectors:
A vector is an imaginary line that has length and direction. We can also think of a vector as a line that is directed from one point in space to another. For example when a ball is thrown in a staright line, we can say that the velocity of the ball as a vector and has a direction and a certain magnitude. It is represented as $\vec{v}$
- The length of the vector is then given by $\lvert \lvert v \rvert\rvert$. It is a scalar value
- A vector with zero length is a Zero vector, so it becomes a point.
Vector operations:
Two vectors can be added, subtracted or multiplied.
- Multiplication: A vector can be multiplied by a scalar to scale the vector it is called vector scaling.
  Two vectors can also be multiplied which we come across later. $k*\vec{v} = k\vec{v}$
- Addition and subtraction: Two vectors u and v can be added to form a resutling vector u+v. However, subtraction of two vectors can be performed by addition and multiplication of ‘-1’ to another vector $\vec{u}+(-1)*\vec{v} = \vec{u}-\vec{v}$
Vector Bases and coordinates:
Now that ve know what a vecotr is, now how do we represent this vector in space. For this we take the help of coordinates. we define the coordinates as follows
- In 1D:
  When we two vectors of sifferent size, we can always represent the bigger vector as a scaled version of the smaller vector. We use this same principle to represent $\vec{v}$ in 1D space. We consider ‘e’ as a known length and represent v as $v = xe$. Meaning, ‘e’ scaled by ‘x’ in 1D gives us the vector ‘v’.
- In 2D and 3D:
  In the similar way as 1D space, now we have two axes or bases vectors (e). And the vector ‘v’ can be represented as a liner combination of the both scaled bases vectors. $v = x_1*e_1+x_2*e_2 \\ v = x_1*e_1+x_2*e_2+x_3*e_3$
- The scalars $(x_1,x_2,x_3)$ are nothing but the coordinates of the vector v along the axis xyz respectively. They can be represented as $(v_x, v_y, v_z)$ respectively.
- These vectors can be written in another form called vector notation:
  - 1D: $\textbf{v} = \begin{bmatrix} v_x \end{bmatrix}$
  - 2D: $\textbf{v} = \begin{bmatrix} v_x \\\ v_y \end{bmatrix}$
  - 3D: $\textbf{v} = \begin{bmatrix} v_x \\\ v_y \\\ v_z\end{bmatrix}$
- The next set of terms in our equation are $(e_1, e_2, e_3)$ in our equation. These are (simply putting) are the vectors representing the individual axis of the coordinate space.
  - say we have a 3D space: the $e_1$ also called as 1st basis vector for the coordinate space would become $e_1 = \begin{bmatrix} 1\\0\\0 \end{bmatrix}$
  - similarly, $e_2 = \begin{bmatrix} 0\\1\\0 \end{bmatrix}$ and $e_3 = \begin{bmatrix} 0\\0\\1 \end{bmatrix}$
  - These are also called as standard basis
Vector spaces: What happens when we need to represent a vector in more than 3D (as discussed above)? We use something called Vector Space. Vector space is a very common term in linear algebra. We can think of it as a space defined be objects like arrows(vectors), where one can add objects together or scale them. And both these operations follow a well defined commands. They are also called as linear spaces
- A vector space in an n-D is represented as $R^n$ and a vector in this space can be represented as $\vec{u} = \sum_{i = 1}^n u_i*e_i$
- where $u_i$ are the coordinates of $\vec{u}$ and $e_i$ are the bases vectors for $R^n$ vector space.
- Now we can summarise the basis vectors for $R^n$ as (canonical of vector space): $e_1 = (1,0...0), \\ e_2 = (0,1...0), \\ . \\ . \\ e_n = (0,0...1),$
- The set of scalars (Real or complex numbers) used to scale the vectors in this vector space are called as Field for the vector space. The properties(eg: dimensions) of the vector space depends on the Field
- An examlple of vector addition:

Dot Product (Scaler Product)

Dot product is very important concept in linear algebra, it helps determining the length and angle between the vectors.

dot product:
The dot product b/w teo vectors is denoted by $\vec{u} \cdot \vec{v}$. The product is always a scalar value(which is why it is called as a scalar product).
- \[u \cdot v = \begin{cases} \|u\| \|v\| \cos[u, v], & \text{if } u \neq 0 \text{ and } v \neq 0, \\ 0, & \text{if } u = 0 \text{ or } v = 0. \end{cases}\]
- $||\mathbf{u}||$ is the length of the vector u. And it is given by for ex in 3D: $||\mathbf{u}|| = \sqrt{u_x^2 + u_y^2 + u_z^2}$
- $[u,v]$ is the angle between the two vectors and this has a specific implocation in the outcome of the dot product.
- The dot product is +ve ($u \cdot v$) $\iff 0<[u,v]<\pi/2$
- The dot product is -ve ($u \cdot v$) $\iff \pi/2<[u,v]<\pi$
- The dot product is 0 ($u \cdot v$) $\iff [u,v]=\pi/2$ or u=0 or v = 0 (This says that when the dot product is 0, the two vectors are orthogonal)
Normalization:
A unit vector can be produced from a non zero vector. This process is called as normalization and the produced vector is called as normalized vector.
- $n = \frac{\mathbf{u}}{||u||}$. Simply divide the vector with its length.
- This is an useful concept to just preserve the information on the direction of the vector. \
One more area where the dot product is very useful is when we try to project a vector u onto another vector v. Also called as Orthogonal projection.
Dot product in ORTHONORMAL BASIS:
- The dot product for an orthonormal basis is such that $e_i \cdot e_j = \begin{cases} 0, & \text{if} & i \ne j, \\ 1, & \text{if} & i = j. \end{cases}$
- Every vector in the set must be orthogonal (perpendicular) to each other and each vector must be a unit vector ($||v|| = 1$).
- From above sentence, We can now say that standard basis is an orthonormal basis
- This presents an interesting formula to find the dot product b/w two vectors
- Say $\vec{u} = u_1e_1+ u_2e_2 + u_3e_3$ and $\vec{v}= v_1e_1+ v_2e_2 + v_3e_3$
- Then $u \cdot v$ is given by:
- The orthonormal basis is very useful in coordinate systems. The orthogonality and unit vectors make calculations easier compared to a regular basis coordinate system.
- When this basis is stacked into a matrix representation we get an Identity matrix(I). For which the inverse is equal to the transpose of the matrix.

Vector Product (Cross Product)

A vetor product’s output is another vector. This new vecotr is perpendicular to the two vectors.
The vector product between two vectors (u,v) can be given by ($\vec{u} \times \vec{u}$):
1. $\vec{u} \times \vec{u}$ is orthogonal to both $\vec{u}$,$\vec{v}$,
2. $||\vec{u}\times\vec{v} = ||\vec{u}|| \cdot ||\vec{v}|| \cdot \sin[u,v]$ (Magnitude)
3. All the three vectors are positively oriented
Orientation:
- The term orientation means how the vectors u,v are oriented. Most common way of representing the orientation is the right hand rule.
- With the right hand, point to the vectors with thumb and index finger, the middle finger gives the direction of the resulting vector
- In a 3D space the two vectors u ,v represent the two normals of a 2D plane then:
  - $u \times v = w$ the vector w is pointing upward
  - $v \times u = -w$ the vector w is pointing downward
Magnitude:
- The magnitude of the resulting vector is dependent on the sin of the angle between the two vectors.
- From this we can say that when the an angle [u,v] becomes 0. then $u \times v$ also becomes 0.
For calculating the cross product between u,v in orthonormal basis:
- We can use Determinant Form or sarrus rule:
- $\vec{u} = \begin{bmatrix} u_x \\\ u_y \\\ u_z\end{bmatrix}$, $\vec{v} = \begin{bmatrix} v_x \\\ v_y \\\ v_z\end{bmatrix}$
- $\vec{u} \times \vec{v} = \begin{vmatrix} \vec{i} & \vec{j} & \vec{k} \\ u_1 & u_2 & u_3 \\ v_1 & v_2 & v_3 \end{vmatrix}$ (determinant form)
- \[\mathbf{u} \times \mathbf{v} = \mathbf{i} \begin{vmatrix} u_2 & u_3 \\ v_2 & v_3 \end{vmatrix} - \mathbf{j} \begin{vmatrix} u_1 & u_3 \\ v_1 & v_3 \end{vmatrix} + \mathbf{k} \begin{vmatrix} u_1 & u_2 \\ v_1 & v_2 \end{vmatrix}\]

My first blog

2026-01-10T00:00:00+00:00

Hi there! Welcome to my first blog. As, I am currently applying for a full time position and working parttime, I find little time to learn and document everything. I expect to maintain consisteny in these blogs in the coming days and months. I have intermediate knowledge about Machine Learning, Deep Learning, SQL, Python, PowerBI and other tools. So, Whoever reading these blogs are expected to have little basic understanding of these concepts.