A: The first step of cluster analysis is usually to choose the analysis method, which will depend on the size of the data as well as the types of variables. Hierarchical clustering, for example, is appropriate for small datasets, while K-means clustering is more appropriate for moderately large datasets and when the number of clusters is known in advance. Large datasets usually require a mixture of different types of variables, and they generally require a two-step procedure.
After you decide on what method of analysis to use, start off the process by choosing the number of cases to subdivide into homogeneous groups or clusters. Those cases, or observations, can be any subject, person, or thing you want to analyze.
Next, you choose the variables to include. There could be 1,000 variables, or even 10,000 or 25,000. The number and types of variables chosen will determine what type of algorithm should be used. Then decide whether to standardize those variables in some way, so that every variable contributes equally to the distance or similarity between the cases. However, the analysis can be run with both standardized and unstandardized variables.
Each analysis method has a different approach. For K-means clustering, select the number of clusters, then the algorithm iteratively estimates the cluster means and assigns each case to the cluster for which its distance to the cluster mean is the smallest. For hierarchical clustering, choose a statistic that quantifies how far apart or similar two cases are. Next, the algorithm selects a method for forming the groups. Finally, the algorithm determines how many clusters are needed to represent the data. It looks at how similar clusters are and splits.