FLD - Fisher Linear Discriminant

Let us assume we have sets , these represent classes, each containing elements ( $\vert D_i\vert=n_i$ ). We are trying to reduce the dimensionality of input data vectors $\mathbf{x} \in D_i$ is to dimensionality , where we believe the classification will be more straightforward, maybe possible using linear classifier. We want to do the following transformation:

$\begin{displaymath} T:\Re^{d} \rightarrow \Re^{c-1} \end{displaymath}$

(3.4)

Fisher's Linear Discriminant is trying to maximize the ratio of scatter between classes to the scatter within classes.

First let us illustrate this problem in two classes problem. We can see on figure 3.14 on page that the projection on vector gives us a chance to find some threshold suitable for distinguishing whether a potential test sample belongs to class or . This could be possible because the projection of and on results in two separate sets. While on figure 3.15 on page this is not possible because the projected class points from overlap with projected class points from . We can see that some projections on vector 1D space are better here and some are worth.

**Figure 3.14:** Projection from 2D to 1D -- we should be able to find some threshold for classification
$\includegraphics[width=80mm,height=60mm]{lda-intro-twoclasses.eps}$

**Figure 3.15:** Projection from 2D to 1D -- not a very good projection for classification purposes
$\includegraphics[width=80mm,height=60mm]{lda-intro-twoclasses2.eps}$

Now take a look at this in more detail. Assume we are trying to project sample $\mathbf{x}$ , to scalar value which corresponds to the metrics of vector $\mathbf{w}$ , we are projecting on. This is done by using the following scalar product:

$\begin{displaymath} y=\mathbf{w}^T \mathbf{x} \end{displaymath}$

(3.5)

It is straightforward that if $\mathbf{m_{D_1}}$ is the mean vector for class , then its projection on vector $\mathbf{w}$ is:

$\begin{displaymath}\tilde{m}_{D_1}=\mathbf{w}^T \mathbf{m_{D_1}}\end{displaymath}$

The same rule is valid for the scatter. if $\mathbf{s_{D_1}}$ is the scatter matrix for class , then $\tilde{s}_{D_1}$ is scatter projected on $\mathbf{w}$ :

$\begin{displaymath}\tilde{s}_{D_1}=\mathbf{w}^T \mathbf{s_{D_1}}\end{displaymath}$

**Figure 3.16:** The best is to separate the classes -- projected distances of and should be maximized and projected scatters and should be minimized.
$\includegraphics[width=80mm,height=60mm]{lda-intro-twoclasses-mu.eps}$

$\begin{displaymath} J(\mathbf{w})={ \vert\tilde{m}_{D_1} - \tilde{m}_{D_2}\vert^2 \over \tilde{s}^2_{D_1} + \tilde{s}^2_{D_2} } \end{displaymath}$

(3.6)

Now we can define scatter matrices. First define scatter within classes matrix:

$\begin{displaymath} \mathbf{S_W}=\sum{S_{D_i}} \end{displaymath}$

(3.7)

$\begin{displaymath}\mathbf{S_{D_i}}=\sum_{\mathbf{x} \in D_i}{ (\mathbf{x} - \mathbf{m_{D_i}})(\mathbf{x} - \mathbf{m_{D_i}})^T }\end{displaymath}$

$\begin{displaymath}\tilde{s}^2_{D_i}=\mathbf{w}^T\mathbf{S_{D_i}}\mathbf{w}\end{displaymath}$

We define scatter within classes as the sum of all class scatters. We are trying to find a projection where these scatters are as small as possible (see figure 3.16 on page ). This is done by $\mathbf{w}^T\mathbf{S_W}\mathbf{w}$ .

We define scatter between classes for two classes (we will change its definition later in Multiple Classes Classification as:

$\begin{displaymath}\mathbf{S_B}=(\mathbf{m_{D_1}} - \mathbf{m_{D_2}})(\mathbf{m_{D_1}} - \mathbf{m_{D_2}})^T\end{displaymath}$

This defines the distance between means of class and class . We are trying to find a projection where this is as distant as possible (see figure 3.16 on page ). This is done by $\mathbf{w}^T\mathbf{S_B}\mathbf{w}$ .

The maximization of mean distances and minimization of class scatters is done by maximizing $J(\mathbf{w})$ which we can now rewrite as:

$\begin{displaymath} J(\mathbf{w}) = { \mathbf{w}^T\mathbf{S_B}\mathbf{w}\over \mathbf{w}^T\mathbf{S_W}\mathbf{w}} \end{displaymath}$

(3.8)

As stated in (3.4) FLD does the dimensionality reduction for us. In multiple classes classification we want to project to space $\Re ^{c-1}$ , where is the number of classes we have: $D_1, D_2, \dots, D_c$ .

We will leave the definition of unchanged but we have to change the definition of . We will define total mean as:

$\begin{displaymath} \mathbf{m} = {1 \over N} \sum_{i=1}^c{n_i\mathbf{m_{D_i}}} \end{displaymath}$

(3.9)

which is in fact center of gravity of all class means $\mathbf{m_{D_i}}$ and if the mean is computed using the proper statistic^3.1 $\mathbf{m}$ is the center of gravity of all $\mathbf{x} \in {D_1, D_2, \dots, D_c}$

$\begin{displaymath} \mathbf{S_B}=\sum_{i=1}^c{ n_i(\mathbf{m_{D_i}} - \mathbf{m})(\mathbf{m_{D_i}} - \mathbf{m})^T} \end{displaymath}$

(3.10)

$\begin{displaymath} \mathbf{y} = \mathbf{W^Tx} \end{displaymath}$

(3.11)

where $\mathbf{y}$ is a projection of $\mathbf{x}$ in $\Re ^{c-1}$ space, $dim(\mathbf{y})=c-1$ and $\mathbf{W}$ is matrix of x .

$\begin{displaymath} J(\mathbf{w}) = { \mathbf{W}^T\mathbf{S_B}\mathbf{W}\over \mathbf{W}^T\mathbf{S_W}\mathbf{W}} \end{displaymath}$

(3.12)

As we can see on figure 3.17 on page LDA transformation is trying to spread classes across $\Re ^{c-1}$ .

**Figure 3.17:** The LDA transformation is trying to maximize $d_1^2+ \dots +d_c^2$ in the space $\Re ^{c-1}$
$\includegraphics[width=80mm,height=60mm]{lda-multi.eps}$

We can mark $\mathbf{w_i}$ as columns of $\mathbf{W}$ in (3.12). Then we need to solve the generalized eigenvalue problem:

$\begin{displaymath} (\mathbf{S_B}- \lambda\mathbf{S_W})\mathbf{w_i} = 0 \end{displaymath}$

(3.13)

$\begin{displaymath} (\mathbf{S^{-1}_W}\mathbf{S_B}- \lambda\mathbf{E})\mathbf{w_i} = 0 \end{displaymath}$

(3.14)

We can use this by Singular Value Decomposition which is implemented in many mathematical programs and libraries.

Once we have the projection matrix $\mathbf{W}$ we can do the classification in space of dimension $\le c-1$ by projecting test sample to this space and finding the maximum for $p(x\vert D_i)P(D_i)$ . Another approach can be using Nearest Neighbor in this reduced space.

We can use svd(inv(Sw)*Sb) function implemented in Matlab or PDL::MatrixOps or ccmath. When using Matlab we can also use eig(Sw,Sb) function which can solve the generalized eigenvalue problem.

The main issue here is that $\mathbf{S^{-1}_W}$ can be singular when we have a high-dimensional data (which is usually true for images) and not enough train data supplied. E.g. in our case we have 32 SIFT tiles each has 9 bins which is in result a vector of dimension 288. Thus $\mathbf{S_W}$ and $\mathbf{S_B}$ are matrices x. In an ideal case we would need of train pictures[6] to not have $\mathbf{S^{-1}_W}$ degenerated.

The possible solution is to use Principal Component Analysis(PCA) first[5,6,7], but the problem is that vectors containing a very important information could be part of PCA null space^3.2[5,6].