From Agner Fog's Instruction Tables:
On Core2 65nm, FSQRT takes 9 to 69 cc's (with almost equal reciprocal throughput), depending on the value and precision bits. For comparison, FDIV takes 9 to 38 cc's (with almost equal reciprocal throughput), FMUL takes 5 (recipthroughput = 2) and FADD takes 3 (recipthroughput = 1). SSE performance is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.
On Core2 45nm, division and square root got faster; FSQRT takes 6 to 20 cc's, FDIV takes 6 to 21 cc's, FADD and FMUL haven't changed. Once again SSE performance is about the same.
You can get the documents with this information from his website.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…