Actually, with certain signal characteristics, upsampling with no other DSP operations is entirely sufficient to cause digital clipping.
Imagine single tone of amplitude A sampled at intervals of 1/4 the wavelength.
If you are lucky, you may sample it at 0, 90, 180, and 270 degrees phase resulting in values of 0, A, 0 and -A respectively.
However, you could also be sampling at 45, 135, 225 and 315 degrees of phase. In this case you will get sample values of (approximately) 0.7A, 0.7A, -0.7A and -0.7A (the correct value is the SQRT(0.5) x A)
Now imagine 0.7A is the maximum value that your encoding scheme can represent (32767 for 16 bit samples). -0.7 A is -32767 which is very nearly (but not quite) the minimum (most negative) value that your encoding scheme can represent.
Now if we resample this wave to double the sampling frequency, we effectively generate samples at 0, 45, 90, 135, 180, 225, 270, and 315 degrees phase. Thus the values that we want to represent are 0, 0.7A, A, 0.7A, 0, -0.7A, -A, -0.7A
However, since 0.7A is the maximum that can be represented and -0.7A is very nearly the minimum that can represented, we have no way to represent A or -A so clipping occurs.
With our 16 bit samples, if 0.7A results in 32768, then the A value obtained during re-sampling, would, if possible, be represented by a sample value 46341 (rounded to the nearest) which is clearly outside the range that can be represented by a signed 16 value (-32768 to 32767) and it is thus clipped.
In power terms, the difference between a signal of amplitude A (that canât be represented) and one of 0.7A (which can be represented) is 20LOG10(A/0.7A) or 20LOG10(SQRT(2)) which is just slightly more than 3dB which is why a headroom value of 3dB was employed by @denydog in post 5 in this thread.