This article was published as a part of the Data Science Blogathon.
Hierarchical clustering is one of the most well-known clustering techniques used in unsupervised machine learning. K-means and hierarchical clustering are two of the most popular and effective clustering algorithms. The work mechanism they implement in the backend allows them to deliver such a high level of performance.
In this article, we will discuss hierarchical clustering and its types, its working mechanism, its basic intuition, pros and cons of using this clustering strategy and end with some basics to remember for this exercise. Knowledge about these concepts will help one to understand the working mechanism and answer the interview questions related to Hierarchical Clustering in a better and more efficient manner.
Hierarchical clustering is an unsupervised machine-learning clustering strategy. Unlike K-means clustering, tree-like morphology is used to bunch datasets, and dendrograms are used to build a hierarchy of clusters.
Here, the dendrogram is the tree-like morphology of the dataset, in which the X axis of the dendrogram represents the features or columns of the dataset, and the Y axis of the dendrogram represents the Euclidean distance between data observations.
import scipy.clusters.heirarchy plt.figure(figsize=(11,7)) plt.title(“Dendrogram”) dendrogram = schs.dendrogram(shc.linkage(data,method=’ward’))
Typical dendrograms look like this:
Hierarchical Clustering Types
There are two types of hierarchical clustering:
Agglomerative Clustering Divisive Clustering
Each dataset is a set in a particular data observation and group clustering. Based on the distance between the groups, after one iteration similar collections are merged based on the algorithm’s loss. Again the loss value is calculated in the next iteration, where similar groups are rejoined. The process continues till we reach the minimum value of loss.
Divisive clustering is the opposite of group clustering. The whole dataset is considered as a set, and the loss is calculated. According to the similarity between the Euclidean distance and the data observations in the next iteration, the entire single set is divided into several groups, hence the name “separator”. This process continues till we reach the minimum loss value.
There is no way to implement Partitional Clustering in Sklearn, however we can do it manually using the code below:
importing required libraries
import crispy import pandas import copy import distancematrix from matplotlib.pyplot ditsance_matrix
creating a diana class
class Dynac Clustering: def __init__(self,datak): self.data = datak self.n_samples, self.n_features = datak.shape def fit(self,no_clusters): self.n_samples, self.n_features = data.shape similarity_matrix = distanceMatrix ( self.datak ) = cluster [list(range(self.n_samples))] while true: csd = [np.max(similarity_matri[clusters][:, clusters]) for groups in groups]mcd = np.argmax(cd) max_difference_index = np.argmax(np.mean(similarity_matrix)[clusters[mcd],[:, clusters[mcd]], axis = 1)) spin = [clusters[mcd][mdi]]LC = Cluster[mcd]dell last_clusters[mdi]while True: Split = False for j in Range(len(lc))[::-1]:spin = similarity_matrix[lc[j]splinters]ld = similarity_matrix[lc[j]np.delete(lc, j,axis=0)]if np.mean(sd) <= np.mean(lc): spin.append(lc)[j]) Dell LC[j]split = true break if split == false: break del cluster[mcd]clusters.append(splinters) clusters. append(lc) if len(cluster) == n_clusters: break cluster_labels = np.zeros(self.n_samples) for i in range(len(cluster)): cl[clusters[i]]= I CL . return
Run the code below with your data:
if __name__ == ‘__main__’: data = pd.read_csv(‘thedata.csv’) data = data.drop(column = “name”) data = data.drop(column = “class”) dynac = dynaclustering(data ) cluster = dianak.fit(3) print(cluster) loss function in clustering
In most clustering techniques, the silhouette score can be used to calculate the disadvantages of a particular clustering algorithm. We calculate the silhouette score using two parameters: cohesion and segmentation.
Coherence corresponds to the similarity between two observations from the data, where b is the distance or difference between two observations from the data. For each data observation in the set, we carefully calculate coherence (A) and partition (B) for each observation in the dataset.
The formula for the silhouette score is:
Hierarchical Clustering vs KMeans
The difference between Kmeans and Hierarchical Clustering is that in Kmeans clustering, the number of clusters is pre-defined and is denoted by “K”, but in hierarchical clustering, the number of sets is either one or the number of data observations. is identical.
Another difference between these two clustering techniques is that K-means clustering is more effective on very large datasets than hierarchical clustering. But small datasets of hierarchical clustering spherical shape.
K-means clustering is more effective on spherical-shaped datasets of clusters than hierarchical clustering.
It is effective in observing data by data size and gives accurate results. Unlike KMeans clustering, here, the improved performance is not limited to the circular shape of the data; Data with any value is acceptable for hierarchical clustering.
It is easy to use and provides better user guidance with good community support. There is so much content and good documentation available for better user experience.
3. More Approaches:
There are two approaches using which datasets can be trained and tested, grouping and segmentation. So if the dataset provided is complex and very difficult to train, we can use another approach.
4. Performance on Small Datasets:
Hierarchical clustering algorithms are effective on small datasets and give accurate and reliable results with short training and testing times.
1. Time Complexity:
The more iterations and computations are involved, the higher the time complexity of hierarchical clustering. In some cases, this is one of the main reasons for preferring KMeans clustering.
2. Space Complexity:
Since there are multiple computations of errors associated with the loss in each epoch, the space complexity of the algorithm is very high. Due to this, when implementing hierarchical clustering, the location of the model is considered. In such cases, we prefer KMeans clustering.
3. Poor Performance on Large Datasets:
When training a hierarchical clustering algorithm for large datasets, the training process takes so much time with space resulting in poor performance of the algorithm.
This article uses mathematical formulation to discuss the most powerful concept related to hierarchical clustering algorithm and its basic intuition and working mechanism. The knowledge of these concepts will help them to understand the concept better. It will help in answering Hierarchical Clustering related questions asked in data science interview very efficiently.
Some of the key excerpts from this article are as follows:
1. It is effective on small datasets but behaves poorly on large datasets and with spherical shape of datasets.
2. Since hierarchical clustering involves computation, its space and time complexity is very high.
3. Dendrograms are an integral part of hierarchical clustering, where the silhouette score is used to calculate the error in the clustering algorithm.
Read here top 20 questions to test your skills in hierarchical structure.
The media shown in this article is not owned by Analytics Vidya and is used at the sole discretion of the author.