Modification of the EM-algorithm for solving the clustering problem with outliers

The main problem of cluster analysis of practical data is the presence of outliers. Most of the existing clustering methods do not take into account their existence, because of this, clearly anomalous observations are included in the composition of some clusters, which can seriously displace their centers and affect the quality of the classification. Of course, you can first check the initial data for outliers, weed them out, etc., but then the task will turn into a two-stage task, but we would like it to be "all at once".





One of the approaches to solving this problem (for the clustering method to automatically filter out outliers) was called "optimally tuned robust improper maximum likelihood estimator" and was described in this 2017 article ( http://dx.doi.org/10.1080/01621459.2015. 1100996 ), and recently received an implementation in R. Let's talk about it.





A moment of theory

In short, clustering using EM algorithms is based on the assumption that specific clusters have a shape close to the shape of a multivariate normal distribution. Then the clustering problem can be considered as the problem of separating the components of the mixture, and the clusters are determined according to the probability of falling into one or another component.





The mathematical upgrade of the classic algorithm is that a slightly different assumption is taken. It is believed that the probability function of the distribution of points is the sum of a mixture of Gaussians and a uniform distribution of noise points (it is assumed that the outliers are distributed evenly), that is





δ - improper constant density (noise density)
δ - improper constant density (noise density)

This leads to some complication of the solution to the optimization problem, which now looks like a solution to the maximum problem





but with such restrictions





This optimization problem (especially with conditions imposed on the ratio of eigenvalues) does not always have a solution. The authors are honest about this, but they also claim that their algorithm is quite easy to cope with clusters of a rather complex shape. Let's check





Experiment

- , .





library(ggplot2)
#   
set.seed(739)
n <- 500 # numer of points
y <- abs(c(rnorm(n,5,2)))
x <- c(rnorm(n,5,1))/sqrt(y)
class<-rep(1,n)
dat1 <- as.data.frame(cbind(x,y,class)) #   - -    
y <- c(rnorm(n,8,0.4))
x <- rlnorm(n, meanlog = log(7), sdlog = 0.2) 
class<-rep(2,n)
dat2 <- as.data.frame(cbind(x,y,class)) #   -       
y <- c(runif(n, min = 2, max = 7))
x <- c(rnorm(n,8,0.4))
class<-rep(3,n)
dat3 <- as.data.frame(cbind(x,y,class)) #   -  
y <- c(runif(n/10, min = 0, max = 10))
x <- c(runif(n/10, min = 0, max = 10))
class<-rep(0,n/10)
dat4 <- as.data.frame(cbind(x,y,class)) # 
dat <- rbind(dat1,dat2,dat3,dat4)
colnames(dat) <- c("x","y", "class")
dat$class <- as.factor(dat$class)
dat_test <- as.data.frame(dat)
p_base <- ggplot(dat,aes(x=x,y=y,color=class)) + geom_point()
ggExtra::ggMarginal(p_base, groupColour = TRUE, groupFill = TRUE)
      
      







Hereinafter, the label "0" denotes observations defined as noise
"0" ,

otrimle , , . 2 5, .





library(otrimle)
clus <- otrimleg(dat[,c(1,2)], G=2:5, monitor=1) #  monitor    
      
      



-





, ( , , , , iloglik . , , 4 5 3 - ). . , ,





clus$solution
      
      







Noise - , . ( 1,2,10 ) - 5 .





clus_2 <-otrimle(dat[,c(1,2)], G = 5, npr.max = 0.01, erc = 20, monitor = TRUE)
# npr.max -     
# erc -  /  
clus_2$code
clus_2$flag
      
      



clus_2$code 0, , , clus_2$code = 1, , EM- ( ), clus_2$code = 2, , .





clus_2$flag .





clus_2$flag = 1,





, , 0.





clus_2$flag = 2,





clus_2$flag = 3,





clus_2$flag = 4,





(clus_2$code = 2, clus_2$flag = 4), .





clus_2$mean #  
head(clus_2$tau) #    
head(clus_2$cluster) #   
      
      



. , .





Left - true partition, right - proposed by the algorithm for the case of 5 clusters
- , - 5
Left - true partition, right - proposed by the algorithm for the case of 4 clusters
- , - 4
Left - true partition, right - proposed by the algorithm for the case of 3 clusters
- , - 3
Left - true partition, right - proposed by the algorithm for the case of 2 clusters
- , - 2

It can be noted that even if one guesses with the number of clusters (3), then the clustering results, in fact, will be very different from the true ones - and a more adequate picture will be given by partitions with a larger number of clusters than in reality. This was due to the non-elliptical shape of the clusters, but in general the algorithm works well even in noise conditions.








All Articles