The other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step. This is specific to AI chips. You're multiplying low-precision numbers, and then when you accumulate, errors accumulate quickly, so you need more precision there.