sine, cosine, tangent…
Precise evaluation, almost by inspection.
Q: What is the easiest, most fundamental way to precisely evaluate triginometric functions…almost by inspection? And how is it all related to square root?
Since: , , … what else?
Q: Would you believe that: = … (exactly) almost by inspection?
Q: How about the inverses: arcsine, arccosine, arctangent?
Do they have something to do with squaring (the inverse of square root)?
An Old Algorithm
In the mid-1950s, programmers were devising the very first math libraries for their new vacuum-tube computers. One possible technique used a clever algorithm to calculate the inverses of a certain class of functions, such as Arccosine. 
Input = Y: -1 ≤ Y ≤ 1
X := 0; z := Y
for i := 1 to n do
if z ≥ 0 then
z := 2z2 – 1
z := 1 – 2z2
X := X + 2-i
Output = X = Arccos(Y)/π to ‘n’ binary digits.
Note how the only operations are add, subtract, shift, square; and no constants except 1 and 2.
Consider the deep relation to the “double-angle formula”: cos 2α = 2 cos2 α – 1
For example, evaluate: Arccos (0.471396737) = 0.0101102 π (n = 6) = 22π / 64
Though unique in its simplicity, some noted that this algorithm loses accuracy for some inputs (when ‘z’ nears ±1.0), and its convergence is relatively slow.  Also, there is no similar way to calculate cosine directly. Thus, this algorithm was eventually abandoned (though occasionally revisited ), and nearly forgotton with the passage of time.
A New Algorithm
In fact, the above algorithm can be reversed to produce the cosine with perfect symbolic accuracy.
Do you see why this works?
Input = Xn: a binary number 0.b1b2b3b4…bn000…
Y := 1.0; b0 := 0; bn+1 := 0
for i := n downto 0 do
Note: ⨁ is bitwise exclusive-OR
Output = Y = cos(π Xn) exactly!
Or symbolically, cos (11 π / 32) =
For example, evaluate: cos (11 π / 32) = cos (0.0101102 π) [ Xn = 11/32 = 0.0101102 (n=5) ]
Try it using the online spreadsheet below.
Square Root Algorithm
During an internship in 1988, my officemate was updating a drawing program when he came across some assembly code for calculating the distance between two points:
To our amazement, the square root routine simply did a few shifts, increments, subtractions in a loop.
(It credited Dr. Dobb’s Journal.) [4, 5]
Square Root Algorithm
Input = Xn: 0 ≤ Xn < 1 (‘n’ binary digits) Y := 0; R := Xn for i := 1 to n step 2 R := 4R Y := 2Y T := R – (2Y + 1) if T >= 0 then
R := T
Y := Y + 1
Output = Y = 2n/2√̅Xn
Remainder = R: Y2 + R = 2n Xn
For example, evaluate: (n = 8)
Y2 + R = 132 + 1 = 170 = 28 * 170/256
The algorithms described above are particularly simple to implement in hardware, with no operation more complex than binary subtraction, and where the potential drawbacks of a software implementation evaporate.
An n-bit square root can be implemented with a 2n-bit input using an n+2 -bit subtractor and four n-bit shift registers (plus control logic).
The square root circuit becomes a cosine circuit by adding another shift register (angle input) and three exclusive-or gates. The first XOR creates the ‘±S’ value, combining two adjacent bits of the shifting angle value. The other two XOR gates implement the ‘±’ operation, doing a serial, 1s-complement negation of the square-root input as it shifts out from the input registers.
The square root then produces: (Where 0 ≤ T < 1, so is trivial to represent in the data lines.)
The cosine circuit can also produce the sine by trivially adding two 1-bit inverter gates as feedback around the root input registers. Then as the circuit begins its final square root iteration to produce the cosine, its input registers contain the cosine-squared. Feeding that back in as the 1s-complement negation produces 1-cos2x = sin2x. Then one additional square root iteration produces: sin(x)
The cosine circuit described above does lose accuracy for certain inputs. In those cases, the internal precision may need to be as high as 2n bits to produce an n-bit cosine. However, the missing bits of accuracy do not vanish. In fact, they are immediately available in the square root remainder register!
To see why, consider with remainder, such that: (exactly)
Now when , , since: and
(In other cases the remainder is not such a good extension of the square root result, although it is always better than extending a 2n-bit intermediate result by simply appending zeros.)
The combined ~2n bits of square root output then feed back into the 2n-bit input registers, as the iterations ultimately produce an n-bit cosine, which is found to be accurate to better than 1.3 in 216.
Combined Sine, Cosine, Square Root
The essence of the combined trigonometric circuit, with optimizations, looks like this:
The full trignometric processor was constructed while I was a student at BYU, and full details have been peer-reviewed and published. [6, 7]
The project included a simple Z80 interface which attached to an 80s-era Timex-Sinclair 2068.
A trigonometry-intensive, 3-D plotting program was implemented in Z80 assembly with both software and hardware evaluation of sine and square root. (In software, the square root used the algorithm above, but the sine used a rough CORDIC implementation.)
Here you can see the the execution, with the hardware evaluation providing an ~11X speedup:
 D. R. Morrison, A Method for Computing Certain Inverse Functions 
 M. V. Wilkes, D. J. Wheeler, Note on “A Method for Computing Certin Inverse Functions” 
 J. H. Wensley, A Class of Non-Analytical Iterative Processes 
 “68000 Restoring Square Root Routine,” Dr. Dobb’s Journal, vol. 10 #5 (May 1985) pp. 118,122
 “Letters: Right to Optimize [Square Root],” Dr. Dobb’s Journal, vol. 11 #8 (Aug. 1986) pp. 12-14,81-85
 R. E. Fowkes, Minimal Hardware Algorithms for Trigonometric Functions [Wescon, Nov. 19-21, 1991]
 R. E. Fowkes, Hardware Efficient Algorithms for Trigonometric Functions [IEEE Trans. Comp., Feb. 1993]
Cosine: Cell by Cell