Normalization is not always required, but it rarely hurts.
Some examples:
K-means:
K-means clustering is "isotropic" in all directions of space and
therefore tends to produce more or less round (rather than elongated)
clusters. In this situation leaving variances unequal is equivalent to
putting more weight on variables with smaller variance.
Example in Matlab:
X = [randn(100,2)+ones(100,2);...
randn(100,2)-ones(100,2)];
% Introduce denormalization
% X(:, 2) = X(:, 2) * 1000 + 500;
opts = statset('Display','final');
[idx,ctrs] = kmeans(X,2,...
'Distance','city',...
'Replicates',5,...
'Options',opts);
plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(ctrs(:,1),ctrs(:,2),'kx',...
'MarkerSize',12,'LineWidth',2)
plot(ctrs(:,1),ctrs(:,2),'ko',...
'MarkerSize',12,'LineWidth',2)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title('K-means with normalization')
(FYI: How can I detect if my dataset is clustered or unclustered (i.e. forming one single cluster)
Distributed clustering:
The comparative analysis shows that the distributed clustering results
depend on the type of normalization procedure.
Artificial neural network (inputs):
If the input variables are combined linearly, as in an MLP, then it is
rarely strictly necessary to standardize the inputs, at least in
theory. The reason is that any rescaling of an input vector can be
effectively undone by changing the corresponding weights and biases,
leaving you with the exact same outputs as you had before. However,
there are a variety of practical reasons why standardizing the inputs
can make training faster and reduce the chances of getting stuck in
local optima. Also, weight decay and Bayesian estimation can be done
more conveniently with standardized inputs.
Artificial neural network (inputs/outputs)
Should you do any of these things to your data? The answer is, it
depends.
Standardizing either input or target variables tends to make the training
process better behaved by improving the numerical condition (see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html) of the optimization
problem and ensuring that various default values involved in
initialization and termination are appropriate. Standardizing targets
can also affect the objective function.
Standardization of cases should be approached with caution because it
discards information. If that information is irrelevant, then
standardizing cases can be quite helpful. If that information is
important, then standardizing cases can be disastrous.
Interestingly, changing the measurement units may even lead one to see a very different clustering structure: Kaufman, Leonard, and Peter J. Rousseeuw.. "Finding groups in data: An introduction to cluster analysis." (2005).
In some applications, changing the measurement units may even lead one
to see a very different clustering structure. For example, the age (in
years) and height (in centimeters) of four imaginary people are given
in Table 3 and plotted in Figure 3. It appears that {A, B ) and { C,
0) are two well-separated clusters. On the other hand, when height is
expressed in feet one obtains Table 4 and Figure 4, where the obvious
clusters are now {A, C} and { B, D}. This partition is completely
different from the first because each subject has received another
companion. (Figure 4 would have been flattened even more if age had
been measured in days.)
To avoid this dependence on the choice of measurement units, one has
the option of standardizing the data. This converts the original
measurements to unitless variables.
Kaufman et al. continues with some interesting considerations (page 11):
From a philosophical point of view, standardization does not really
solve the problem. Indeed, the choice of measurement units gives rise
to relative weights of the variables. Expressing a variable in smaller
units will lead to a larger range for that variable, which will then
have a large effect on the resulting structure. On the other hand, by
standardizing one attempts to give all variables an equal weight, in
the hope of achieving objectivity. As such, it may be used by a
practitioner who possesses no prior knowledge. However, it may well be
that some variables are intrinsically more important than others in a
particular application, and then the assignment of weights should be
based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985). On
the other hand, there have been attempts to devise clustering
techniques that are independent of the scale of the variables
(Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is
to search for a partition that minimizes the total volume of the
convex hulls of the clusters. In principle such a method is invariant
with respect to linear transformations of the data, but unfortunately
no algorithm exists for its implementation (except for an
approximation that is restricted to two dimensions). Therefore, the
dilemma of standardization appears unavoidable at present and the
programs described in this book leave the choice up to the user.