Keyboard Matrix Scanning and Debouncing

# Intro

Without diving into the “why” part, I wanted to make yet another keyboard with Cherry MX keyswitches (just like everyone else these days), and I ended up deciding to make my own keyboard controller (again like everyone else). After all…

(customary https://xkcd.com/927/ )

Re: HELP WANTED ON ALGORITHMS
Message #19 Posted by Eric Smith on 27 Apr 2013, 12:49 p.m.,
in response to message #18 by Eddie W. Shore

Usually CORDIC isn’t used for logs and exponentials, although it is possible to do so by using hyperbolic CORDIC. CORDIC is in the general class of shift-and-add algorithms, and there is a simpler one for logs and exponential first published in 1624, only ten years after the invention of logarithms by Napier. I’m not sure whether the HP 9100A (1968) used Briggs’ algorithm, but all of HP’s handheld and handheld-derived calculators from the HP-35 (1972) through the Saturn-based and Saturn-emulating calculators have used it. Recent HP-designed calculators such as the HP 10bII+, 20b, 30b, and 49gII apparently use a math library based on that from Saturn calculators, so they presumably also use Briggs’ algorithm.

There is a good description of Briggs’ algorithm on Jacques Laporte’s “Briggs and the HP35” web page.

Another excellent reference for algorithms for transcendental functions is Elementary Functions: Algorithms and Implementations by Jean-Michel Muller, second edition, which covers many classes of algorithms including shift-and-add (Briggs’, CORDIC, and others), and has especially good coverage of accurate argument range reduction algorithms.

It should be noted that the HP algorithms for sine and cosine use CORDIC, but not in the most basic method normally described. Instead, they compute the tangent (or cotangent), and use an identity to compute the sine (or cosine) from it. This avoids an issue with each CORDIC iteration effectively multiplying the resulting sine and cosine by a scale factor; by using tangent (or cotangent), these scale factors cancel out. In binary CORDIC, the number of CORDIC iteration is fixed, so the result can be multiplied by the inverse of the product of the scale factors, which is constant. In decimal CORDIC as implemented by HP, the number of CORDIC iterations is variable, so it would take more work to compensate for the non-constant product of the scale factors.

For specifics of the algorithms used by HP, see:

Edited: 27 Apr 2013, 1:04 p.m.

Mainframes (Hercules, IBM)

https://keisan.casio.com/calculator

 Function list Input data storage
Expression
Mode Digit Answer  Accuracy Comma format
Editor

(-6.2 -7.6i)!
 ans1 -9.85941053E-11 +3.2965092E-12i
Function List
Inserted in the cursor position of Expression
Elementary Prob Bessel Sp1 Sp2
 Constant Elementary function Real numerical function Complex numerical function Sum Prod
 Input data storage File 105.Quadrilateral3 <01.Heron’s formula> <02.Three means> <03.Future value> <04.Trigonometric functions> <05.Radioactive decay> <06.Half-life> <10.sides of quadrilateral> <11.Pi Polygon Method> <90.complete elliptic integral> <91.Y(n,m,θ,φ)> <95.DE integration (a,∞)> <96.for-repeat sin(x)> <97.var-repeat sin(x)> <98.Colebrook by Simple> <98.Colebrook by Simple Ran> <98.Colebrook by Simple_Er> <98.Colebrook-White Eq> <98.Colebrook-White TableR> <98.Colebrook_HG1> <98.Colebrook_HG2> <98.Colebrook_HG2_Ac> <98.Colebrook_HG3> <98.Colebrook_HG3_Ac> <98.Colebrook_HG4> <98.Colebrook_SN1> <98.Colebrook_SN2>

https://medium.com/@ly.lee/build-your-iot-sensor-network-stm32-blue-pill-nrf24l01-esp8266-apache-mynewt-thethings-io-ca7486523f5d

# Build Your IoT Sensor Network — STM32 Blue Pill + nRF24L01 + ESP8266 + Apache Mynewt + thethings.io

May 27 · 18 min read

Law of Thermospatiality: Air-conditioning always feels too warm ANDtoo cold by different individuals in a sizeable room

http://www.xnumber.com/xnumber/cmhistory.htm

## History of Mechanical Calculators

 A Brief History of Mechanical Calculators James Redin From the Abacus to the electro-mechanical calculators. (Parts I, II and III)

 Part I The Age of the Polymaths Part II Crossing the 19th Century Part III Getting Ready for the 20th Century

http://www.rechenautomat.de/

# Mechanische Rechenmaschinen

#### Bilder und Aufsätze zur Automatisierung der mechanischen Rechentechnik

“Ein Automat (griech.) ist im weitern Sinn jede sich selbst bewegende mechanische Vorrichtung, die durch im Innern verborgene Kraftmittel (Federn, Gewichte etc.) in Bewegung gesetzt wird, z. B. Uhren, Planetarien u. dgl.; im engern Sinn ein mechanisches Kunstwerk, welches vermittelst eines innern Mechanismus die Thätigkeit lebender Wesen, der Menschen (Android) oder Tiere, nachahmt und meist auch an Gestalt diesen nachgebildet ist.” (aus Meyers Konversationslexikon von 1885-1892)

In der Fachliteratur verwendet(e) man den Begriff “Rechenautomat” meist im Zusammenhang mit (elektronischen) programmgesteuerten Rechenanlagen, die im technisch-wissenschaftlichen Bereich zum Einsatz kamen. Betrachtet man jedoch Prospekte von Büromaschinen-Herstellern aus den 1930er bis 1970er Jahren, so findet man auch dort Rechenautomaten im Angebot. Sie waren in unterschiedlichen Ausführungen erhältlich: von Halbautomaten bis hin zu “vollelektrischen Speicher-Superautomaten”. Es handelt sich hierbei um elektrisch angetriebene, aber mechanische Rechenmaschinen für den Einsatz im Büro sowie für Privatleute. Sie konnten teilweise oder vollständig eine Multiplikation oder Division ausführen und oftmals auch Rechenergebnisse zwischenspeichern. Manche von ihnen waren sogar in der Lage, selbsttätig eine Wurzel zu berechnen. In diesem Sinne sind die vollautomatischen Rechenmaschinen die mechanischen Vorläufer unserer heutigen elektronischen Taschenrechner.

Nachfolgende Liste ist eine Sammlung von Aufsätzen und Büchern zu elektrisch-mechanischen Rechenmaschinen, die im Zeitraum von 1900 bis 1970 produziert wurden.

# Využití systémů počítačové algebry (CAS) v matematice

(best my Marchant from Robert Mařík)

Pi= 3.141592653589793238462643383279502884197169399375105820974…

22/7 = 3,14…

22/7-Pi = 0.0012644892673496186802137595776400046…

355/113 = 3.141592…

355/113-Pi = 2,667641e-7

103993/33102 = 3.141592653…

103993/33102 -Pi = -5,77890… e-10

\\\\\\\\\\\\\\\\

http://qin.laya.com/tech_projects_approxpi.html

## Fractional Approximations of Pi

After reading the American Scientist article, On the Teeth of Wheels, which describes the intricate interplay between pure and applied mathematics, and how clock makers independently developed mathematical methods to approximate gear ratios that were not feasibly made (such as representing gear ratios that were two primes with other gear ratios that were close in value), I decided to write a program to find fractional approximations of Pi. A couple of minutes searching google brought me to an existing fractional approximations of pi page, which attempted to find decimal approximations using an iterative approach, which turned out to be very slow. However since the code was open source, I decided to modify for my own uses. The stern-brocot method pretty much changes the O(n2) to about O(1) for the calculation of successive values.

You can get several versions of the source:
The original modification to use the stern-brocot method
A 64 bit version

### Fractional Approximation Table

```pi = 3.14159265358979323846264338327950288419716939937510

Num.                Den. = Result                                   (Accuracy                                 )
-------------         ----------- = --------------------------------         (---------------------------------        )

7/                  2 = 3.50000000000000000000000000000000000000 (-0.35840734641020676153735661672049711581) [ 0]
10/                  3 = 3.33333333333333333333333333333333333333 (-0.19174067974354009487068995005383044914) [ 0]
13/                  4 = 3.25000000000000000000000000000000000000 (-0.10840734641020676153735661672049711581) [ 0]
16/                  5 = 3.20000000000000000000000000000000000000 (-0.05840734641020676153735661672049711581) [ 1]
19/                  6 = 3.16666666666666666666666666666666666666 (-0.02507401307687342820402328338716378247) [ 1]
22/                  7 = 3.14285714285714285714285714285714285714 (-0.00126448926734961868021375957763997295) [ 2]
179/                 57 = 3.14035087719298245614035087719298245614 ( 0.00124177639681078232229250608652042805) [ 2]
201/                 64 = 3.14062500000000000000000000000000000000 ( 0.00096765358979323846264338327950288419) [ 3]
223/                 71 = 3.14084507042253521126760563380281690140 ( 0.00074758316725802719503774947668598279) [ 3]
245/                 78 = 3.14102564102564102564102564102564102564 ( 0.00056701256415221282161774225386185855) [ 3]
267/                 85 = 3.14117647058823529411764705882352941176 ( 0.00041618300155794434499632445597347243) [ 3]
289/                 92 = 3.14130434782608695652173913043478260869 ( 0.00028830576370628194090425284472027550) [ 3]
311/                 99 = 3.14141414141414141414141414141414141414 ( 0.00017851217565182432122924186536147005) [ 3]
333/                106 = 3.14150943396226415094339622641509433962 ( 0.00008321962752908751924715686440854457) [ 4]
355/                113 = 3.14159292035398230088495575221238938053 (-0.00000026676418906242231236893288649634) [ 6]
52163/              16604 = 3.14159238737653577451216574319441098530 ( 0.00000026621325746395047764008509189889) [ 6]
52518/              16717 = 3.14159239097924268708500329006400669976 ( 0.00000026261055055137764009321549618443) [ 6]
52873/              16830 = 3.14159239453357100415923945335710041592 ( 0.00000025905622223430340392992240246827) [ 6]
53228/              16943 = 3.14159239804048869739715516732573924334 ( 0.00000025554930454106548821595376364085) [ 6]
53583/              17056 = 3.14159240150093808630393996247654784240 ( 0.00000025208885515215870342080295504179) [ 6]
53938/              17169 = 3.14159240491583668239268448948686586289 ( 0.00000024867395655606995889379263702130) [ 6]
54293/              17282 = 3.14159240828607800023145469274389538247 ( 0.00000024530371523823118869053560750172) [ 6]
54648/              17395 = 3.14159241161253233687841333716585225639 ( 0.00000024197726090158423004611365062780) [ 6]
55003/              17508 = 3.14159241489604752113319625314142106465 ( 0.00000023869374571732944713013808181954) [ 6]
55358/              17621 = 3.14159241813744963395948016571136711877 ( 0.00000023545234360450316321756813576542) [ 6]
55713/              17734 = 3.14159242133754370136461035299424833652 ( 0.00000023225224953709803303028525454767) [ 6]
56068/              17847 = 3.14159242449711436095702358939877850619 ( 0.00000022909267887750561979388072437800) [ 6]
56423/              17960 = 3.14159242761692650334075723830734966592 ( 0.00000022597286673512188614497215321827) [ 6]
56778/              18073 = 3.14159243069772588944834836496431140375 ( 0.00000022289206734901429501831519148044) [ 6]
57133/              18186 = 3.14159243374023974485868250302430440998 ( 0.00000021984955349360396088025519847421) [ 6]
57488/              18299 = 3.14159243674517733209464998087327176348 ( 0.00000021684461590636799340240623112071) [ 6]
57843/              18412 = 3.14159243971323050184662176841190527916 ( 0.00000021387656273661602161486759760503) [ 6]
58198/              18525 = 3.14159244264507422402159244264507422402 ( 0.00000021094471901444105094063442866017) [ 6]
58553/              18638 = 3.14159244554136709947419250992595772078 ( 0.00000020804842613898845087335354516341) [ 6]
58908/              18751 = 3.14159244840275185323449416031145005599 ( 0.00000020518704138522814922296805282820) [ 6]
59263/              18864 = 3.14159245122985581000848176420695504664 ( 0.00000020235993742845416161907254783755) [ 6]
59618/              18977 = 3.14159245402329135269009854033830426305 ( 0.00000019956650188577254484294119862114) [ 6]
59973/              19090 = 3.14159245678365636458878994237820848611 ( 0.00000019680613687387385344090129439808) [ 6]
60328/              19203 = 3.14159245951153465604332656355777743060 ( 0.00000019407825858241931681972172545359) [ 6]
60683/              19316 = 3.14159246220749637606129633464485400704 ( 0.00000019138229686240134704863464887715) [ 6]
61038/              19429 = 3.14159246487209840959390601677904163878 ( 0.00000018871769482886873736650046124541) [ 6]
61393/              19542 = 3.14159246750588476102753044724183809231 ( 0.00000018608390847743511293603766479188) [ 6]
61748/              19655 = 3.14159247010938692444670567285677944543 ( 0.00000018348040631401593771042272343876) [ 6]
62103/              19768 = 3.14159247268312424119789558883043302306 ( 0.00000018090666899726474779444906986113) [ 6]
62458/              19881 = 3.14159247522760424525929279211307278305 ( 0.00000017836218899320335059116643010114) [ 6]
62813/              19994 = 3.14159247774332299689906972091627488246 ( 0.00000017584647024156357366236322800173) [ 6]
63168/              20107 = 3.14159248023076540508280698264286069528 ( 0.00000017335902783337983640063664218891) [ 6]
63523/              20220 = 3.14159248269040553907022749752720079129 ( 0.00000017089938769939241588575230209290) [ 6]
63878/              20333 = 3.14159248512270692962179707864063345300 ( 0.00000016846708630884084630463886943119) [ 6]
64233/              20446 = 3.14159248752812286021715739019857184779 ( 0.00000016606167037824548599308093103640) [ 6]
64588/              20559 = 3.14159248990709664866968237754754608687 ( 0.00000016368269658979296100573195679732) [ 6]
64943/              20672 = 3.14159249226006191950464396284829721362 ( 0.00000016132973131895799942043120567057) [ 6]
65298/              20785 = 3.14159249458744286745248977628097185470 ( 0.00000015900235037101015360699853102949) [ 6]
65653/              20898 = 3.14159249688965451239353048138577854340 ( 0.00000015670013872606911290189372434079) [ 6]
66008/              21011 = 3.14159249916710294607586502308314692304 ( 0.00000015442269029238677836019635596115) [ 6]
66363/              21124 = 3.14159250142018557091459950766900208293 ( 0.00000015216960766754804387561050080126) [ 6]
66718/              21237 = 3.14159250364929133116730234967274097094 ( 0.00000014994050190729534103360676191325) [ 6]
67073/              21350 = 3.14159250585480093676814988290398126463 ( 0.00000014773499230169449350037552161956) [ 6]
67428/              21463 = 3.14159250803708708009131994595350137445 ( 0.00000014555270615837132343732600150974) [ 6]
67783/              21576 = 3.14159251019651464590285502410085279940 ( 0.00000014339327859255978835917865008479) [ 6]
68138/              21689 = 3.14159251233344091474941214440499792521 ( 0.00000014125635232371323123887450495898) [ 6]
68493/              21802 = 3.14159251444821576002201632877717640583 ( 0.00000013914157747844062705450232647836) [ 6]
68848/              21915 = 3.14159251654118183892311202372804015514 ( 0.00000013704861139953953135955146272905) [ 6]
69203/              22028 = 3.14159251861267477755583802433266751407 ( 0.00000013497711846090680535894683537012) [ 6]
69558/              22141 = 3.14159252066302335034551284946479382141 ( 0.00000013292676988811713053381470906278) [ 6]
69913/              22254 = 3.14159252269254965399478745394086456367 ( 0.00000013089724358446785592933863832052) [ 6]
70268/              22367 = 3.14159252470156927616577994366700943354 ( 0.00000012888822396229686343961249345065) [ 6]
70623/              22480 = 3.14159252669039145907473309608540925266 ( 0.00000012689940177938791028719409363153) [ 6]
70978/              22593 = 3.14159252865931925817731155667684681095 ( 0.00000012493047398028533182660265607324) [ 6]
71333/              22706 = 3.14159253060864969611556416806130538183 ( 0.00000012298114354234707921521819750236) [ 6]
71688/              22819 = 3.14159253253867391209080152504491870809 ( 0.00000012105111932637184185823458417610) [ 6]
72043/              22932 = 3.14159253444967730682016396302110587824 ( 0.00000011914011593164247942025839700595) [ 6]
72398/              23045 = 3.14159253634193968322846604469516164026 ( 0.00000011724785355523417733858434124393) [ 6]
72753/              23158 = 3.14159253821573538302098626824423525347 ( 0.00000011537405785544165711503526763072) [ 6]
73108/              23271 = 3.14159254007133341927721198057668342572 ( 0.00000011351845981918543140270281945847) [ 6]
73463/              23384 = 3.14159254190899760520013684570646595963 ( 0.00000011168079563326250653757303692456) [ 6]
73818/              23497 = 3.14159254372898667915052985487509043707 ( 0.00000010986080655931211352840441244712) [ 6]
74173/              23610 = 3.14159254553155442609063955950868276154 ( 0.00000010805823881237200382377082012265) [ 6]
74528/              23723 = 3.14159254731694979555705433545504362854 ( 0.00000010627284344290558904782445925565) [ 6]
74883/              23836 = 3.14159254908541701627789897633831179728 ( 0.00000010450437622218474440694119108691) [ 6]
75238/              23949 = 3.14159255083719570754520021712806380224 ( 0.00000010275259753091744316615143908195) [ 6]
75593/              24062 = 3.14159255257252098744908985121768764026 ( 0.00000010101727225101355353206181524393) [ 6]
75948/              24175 = 3.14159255429162357807652533609100310237 ( 0.00000009929816966038611804718849978182) [ 7]
76303/              24288 = 3.14159255599472990777338603425559947299 ( 0.00000009759506333068925734902390341120) [ 7]
76658/              24401 = 3.14159255768206221056514077291914265808 ( 0.00000009590773102789750261036036022611) [ 7]
77013/              24514 = 3.14159255935383862282777188545321041037 ( 0.00000009423595461563487149782629247382) [ 7]
77368/              24627 = 3.14159256101027327729727534819507045113 ( 0.00000009257951996116536803508443243306) [ 7]
77723/              24740 = 3.14159256265157639450282942603071948261 ( 0.00000009093821684395981395724878340158) [ 7]
78078/              24853 = 3.14159256427795437170562909910272401722 ( 0.00000008931183886675701428417677886697) [ 7]
78433/              24966 = 3.14159256588960986942241448369782904750 ( 0.00000008770018336904022889958167383669) [ 7]
78788/              25079 = 3.14159256748674189560987280194585111049 ( 0.00000008610305134285277058133365177370) [ 7]
79143/              25192 = 3.14159256906954588758335979676087646872 ( 0.00000008452024735087928358651862641547) [ 7]
79498/              25305 = 3.14159257063821379174076269511954159257 ( 0.00000008295157944672188068815996129162) [ 7]
79853/              25418 = 3.14159257219293414115980801007160280116 ( 0.00000008139685909730283537320790008303) [ 7]
80208/              25531 = 3.14159257373389213113469899338059613802 ( 0.00000007985590110732794438989890674617) [ 7]
80563/              25644 = 3.14159257526126969271564498518171892060 ( 0.00000007832852354574699839809778396359) [ 7]
80918/              25757 = 3.14159257677524556431261404666692549598 ( 0.00000007681454767415002933661257738821) [ 7]
81273/              25870 = 3.14159257827599536142249710088906068805 ( 0.00000007531379787704014628239044219614) [ 7]
81628/              25983 = 3.14159257976369164453681253127044606088 ( 0.00000007382610159392583085200905682331) [ 7]
81983/              26096 = 3.14159258123850398528510116492949110974 ( 0.00000007235128925317754221835001177445) [ 7]
82338/              26209 = 3.14159258270059903086725933839520775306 ( 0.00000007088919420759538404488429513113) [ 7]
82693/              26322 = 3.14159258415014056682622900995365093837 ( 0.00000006943965267163641437332585194582) [ 7]
83048/              26435 = 3.14159258558728957821070550406657839984 ( 0.00000006800250366025193787921292448435) [ 7]
83403/              26548 = 3.14159258701220430917583245442217869519 ( 0.00000006657758892928681092885732418900) [ 7]
83758/              26661 = 3.14159258842504032106822699823712538914 ( 0.00000006516475291739441638504237749505) [ 7]
84113/              26774 = 3.14159258982595054904011354298946739374 ( 0.00000006376384268942252984029003549045) [ 7]
84468/              26887 = 3.14159259121508535723583888124372373265 ( 0.00000006237470788122680450203577915154) [ 7]
84823/              27000 = 3.14159259259259259259259259259259259259 ( 0.00000006099720064587005079068691029160) [ 7]
85178/              27113 = 3.14159259395861763729576218050381735698 ( 0.00000005963117560116688120277568552721) [ 7]
85533/              27226 = 3.14159259531330345992800999045030485565 ( 0.00000005827648977853463339282919802854) [ 7]
85888/              27339 = 3.14159259665679066534986649109330992355 ( 0.00000005693300257311277689218619296064) [ 7]
86243/              27452 = 3.14159259798921754334838991694594200786 ( 0.00000005560057569511425346633356087633) [ 7]
86598/              27565 = 3.14159259931072011608924360602212951206 ( 0.00000005427907312237339977725737337213) [ 7]
86953/              27678 = 3.14159260062143218440638774477924705542 ( 0.00000005296836105405625563850025582877) [ 7]
87308/              27791 = 3.14159260192148537296246986434457198373 ( 0.00000005166830786550017351893493090046) [ 7]
87663/              27904 = 3.14159260321100917431192660550458715596 ( 0.00000005037878406415071677777491572823) [ 7]
88018/              28017 = 3.14159260449013099189777635007316986115 ( 0.00000004909966224656486703320633302304) [ 7]
88373/              28130 = 3.14159260575897618201208674013508709562 ( 0.00000004783081705645055664314441578857) [ 7]
88728/              28243 = 3.14159260701766809474914138016499663633 ( 0.00000004657212514371350200311450624786) [ 7]
89083/              28356 = 3.14159260826632811397940471152489772887 ( 0.00000004532346512448323867175460515532) [ 7]
89438/              28469 = 3.14159260950507569637149179809617478660 ( 0.00000004408471754209115158518332809759) [ 7]
89793/              28582 = 3.14159261073402840948848925897417955356 ( 0.00000004285576482897415412430532333063) [ 7]
90148/              28695 = 3.14159261195330196898414357902073531974 ( 0.00000004163649126947849980425876756445) [ 7]
90503/              28808 = 3.14159261316301027492363232435434601499 ( 0.00000004042678296353901105892515686920) [ 7]
90858/              28921 = 3.14159261436326544725286124269561910030 ( 0.00000003922652779120978214058388378389) [ 7]
91213/              29034 = 3.14159261555417786043948474202658951574 ( 0.00000003803561537802315864125291336845) [ 7]
91568/              29147 = 3.14159261673585617730812776615089031461 ( 0.00000003685393706115451561712861256958) [ 7]
91923/              29260 = 3.14159261790840738209159261790840738209 ( 0.00000003568138585637105076537109550210) [ 7]
92278/              29373 = 3.14159261907193681271916385796479760324 ( 0.00000003451785642574347952531470528095) [ 7]
92633/              29486 = 3.14159262022654819236247710777996337244 ( 0.00000003336324504610016627549953951175) [ 7]
92988/              29599 = 3.14159262137234366025879252677455319436 ( 0.00000003221744957820385085650494968983) [ 7]
93343/              29712 = 3.14159262250942380183091007000538502961 ( 0.00000003108036943663173331327411785458) [ 7]
93698/              29825 = 3.14159262363788767812238055322715842414 ( 0.00000002995190556034026283005234446005) [ 7]
94053/              29938 = 3.14159262475783285456610328011223194602 ( 0.00000002883196038389654010316727093817) [ 7]
94408/              30051 = 3.14159262586935542910385677681275165551 ( 0.00000002772043780935878660646675122868) [ 7]
94763/              30164 = 3.14159262697255005967378331786235247314 ( 0.00000002661724317878886006541715041105) [ 7]
95118/              30277 = 3.14159262806750999108233972982792218515 ( 0.00000002552228324738030365345158069904) [ 7]
95473/              30390 = 3.14159262915432708127673576834485027969 ( 0.00000002443546615718590761493465260450) [ 7]
95828/              30503 = 3.14159263023309182703340655017539258433 ( 0.00000002335670141142923683310411029986) [ 7]
96183/              30616 = 3.14159263130389338907760648027175333159 ( 0.00000002228589984938503690300774955260) [ 7]
96538/              30729 = 3.14159263236681961664876826450584138761 ( 0.00000002122297362181387511877366149658) [ 7]
96893/              30842 = 3.14159263342195707152584138512418131119 ( 0.00000002016783616693680199815532157300) [ 7]
97248/              30955 = 3.14159263446939105152640930382813761912 ( 0.00000001912040218693623407945136526507) [ 7]
97603/              31068 = 3.14159263550920561349298313377108278614 ( 0.00000001808058762496966024950842009805) [ 7]
97958/              31181 = 3.14159263654148359577948109425611750745 ( 0.00000001704830964268316228902338537674) [ 7]
98313/              31294 = 3.14159263756630664025052725762126925289 ( 0.00000001602348659821211612565823363130) [ 7]
98668/              31407 = 3.14159263858375521380583946254019804502 ( 0.00000001500603802465680392073930483917) [ 7]
99023/              31520 = 3.14159263959390862944162436548223350253 ( 0.00000001399588460902101901779726938166) [ 7]
99378/              31633 = 3.14159264059684506686055701324566117661 ( 0.00000001299294817160208637003384170758) [ 7]
99733/              31746 = 3.14159264159264159264159264159264159264 ( 0.00000001199715164582105074168686129155) [ 7]
100088/              31859 = 3.14159264258137417998053925107504943657 ( 0.00000001100841905848210413220445344762) [ 7]
100443/              31972 = 3.14159264356311772801201050919554610283 ( 0.00000001002667551045063287408395678136) [ 7]
100798/              32085 = 3.14159264453794608072307932055477637525 ( 0.00000000905184715773956406272472650894) [ 8]
101153/              32198 = 3.14159264550593204546866264985402820050 ( 0.00000000808386119299398073342547468369) [ 8]
101508/              32311 = 3.14159264646714741109838754603695335953 ( 0.00000000712264582736425583724254952466) [ 8]
101863/              32424 = 3.14159264742166296570441648161855415741 ( 0.00000000616813027275822690166094872678) [ 8]
102218/              32537 = 3.14159264836954851399944678366167747487 ( 0.00000000522024472446319659961782540932) [ 8]
102573/              32650 = 3.14159264931087289433384379785604900459 ( 0.00000000427892034412879958542345387960) [ 8]
102928/              32763 = 3.14159265024570399536062021182431401275 ( 0.00000000334408924310202317145518887144) [ 8]
103283/              32876 = 3.14159265117410877235673439591191142474 ( 0.00000000241568446610590898736759145945) [ 8]
103638/              32989 = 3.14159265209615326320894843735790718118 ( 0.00000000149363997525369494592159570301) [ 8]
103993/              33102 = 3.14159265301190260407226149477372968400 ( 0.00000000057789063439038188850577320019) [ 9]
104348/              33215 = 3.14159265392142104470871594159265392142 (-0.00000000033162780624607255831315103723) [ 9]
208341/              66317 = 3.14159265346743670552045478534915632492 ( 0.00000000012235653294218859793034655927) [ 9]
312689/              99532 = 3.14159265361893662339750030141060161556 (-0.00000000002914338493485691813109873137) [10]
833719/             265381 = 3.14159265358107777120441930658185778183 ( 0.00000000000871546725822407669764510236) [11]
1146408/             364913 = 3.14159265359140397848254241421927966391 (-0.00000000000161074001989903093977677972) [11]
3126535/             995207 = 3.14159265358865040137378454934501063597 ( 0.00000000000114283708885883393449224822) [11]
4272943/            1360120 = 3.14159265358938917154368732170690821398 ( 0.00000000000040406691895606157259467021) [12]
5419351/            1725033 = 3.14159265358981538324194377730744861112 (-0.00000000000002214477930039402794572693) [13]
42208400/           13435351 = 3.14159265358977223594679439338801048070 ( 0.00000000000002100251584898989149240349) [13]
47627751/           15160384 = 3.14159265358977714548655231951908342163 ( 0.00000000000001609297609106376041946256) [13]
53047102/           16885417 = 3.14159265358978105189821489158366654492 ( 0.00000000000001218656442849169583633927) [13]
58466453/           18610450 = 3.14159265358978423412652568852445803298 ( 0.00000000000000900433611769475504485121) [14]
63885804/           20335483 = 3.14159265358978687646612573696921779531 ( 0.00000000000000636199651764631028508888) [14]
69305155/           22060516 = 3.14159265358978910556761228975786423128 ( 0.00000000000000413289503109352163865291) [14]
74724506/           23785549 = 3.14159265358979101134054126730478241221 ( 0.00000000000000222712210211597472047198) [14]
80143857/           25510582 = 3.14159265358979265937562694571217544154 ( 0.00000000000000057908701643756732744265) [15]
165707065/           52746197 = 3.14159265358979340254615892023457160333 (-0.00000000000000016408351553695506871914) [15]
245850922/           78256779 = 3.14159265358979316028327718420406748404 ( 0.00000000000000007817936619907543540015) [16]
411557987/          131002976 = 3.14159265358979325782644815641440084536 (-0.00000000000000001936380477313489796117) [16]
657408909/          209259755 = 3.14159265358979322134827119529027452029 ( 0.00000000000000001711437218798922836390) [16]
1068966896/          340262731 = 3.14159265358979323539256492948091926059 ( 0.00000000000000000307007845379858362360) [17]
2549491779/          811528438 = 3.14159265358979323901400975919959073572 (-0.00000000000000000055136637592008785153) [18]
3618458675/         1151791169 = 3.14159265358979323794416069185889026381 ( 0.00000000000000000051848269142061262038) [18]
6167950454/         1963319607 = 3.14159265358979323838637750639037956696 ( 0.00000000000000000007626587688912331723) [19]
14885392687/         4738167652 = 3.14159265358979323849387505801156062596 (-0.00000000000000000003123167473205774177) [19]
21053343141/         6701487259 = 3.14159265358979323846238174277486901359 ( 0.00000000000000000000026164050463387060) [21]
899125804609/       286200632530 = 3.14159265358979323846290312739302875642 (-0.00000000000000000000025974411352587223) [21]
920179147750/       292902119789 = 3.14159265358979323846289119831454290206 (-0.00000000000000000000024781503504001787) [21]
941232490891/       299603607048 = 3.14159265358979323846287980289163130623 (-0.00000000000000000000023641961212842204) [21]
962285834032/       306305094307 = 3.14159265358979323846286890609758924162 (-0.00000000000000000000022552281808635743) [21]
983339177173/       313006581566 = 3.14159265358979323846285847590540629124 (-0.00000000000000000000021509262590340705) [21]
1004392520314/       319708068825 = 3.14159265358979323846284848297337933163 (-0.00000000000000000000020509969387644744) [21]
1025445863455/       326409556084 = 3.14159265358979323846283890036945343710 (-0.00000000000000000000019551708995055291) [21]
1046499206596/       333111043343 = 3.14159265358979323846282970332883684002 (-0.00000000000000000000018632004933395583) [21]
1067552549737/       339812530602 = 3.14159265358979323846282086904029653302 (-0.00000000000000000000017748576079364883) [21]
1088605892878/       346514017861 = 3.14159265358979323846281237645725178231 (-0.00000000000000000000016899317774889812) [21]
1109659236019/       353215505120 = 3.14159265358979323846280420613037215131 (-0.00000000000000000000016082285086926712) [21]
1130712579160/       359916992379 = 3.14159265358979323846279634005887720665 (-0.00000000000000000000015295677937432246) [21]
1151765922301/       366618479638 = 3.14159265358979323846278876155814494589 (-0.00000000000000000000014537827864206170) [21]
1172819265442/       373319966897 = 3.14159265358979323846278145514157963557 (-0.00000000000000000000013807186207675138) [21]
1193872608583/       380021454156 = 3.14159265358979323846277440641497885695 (-0.00000000000000000000013102313547597276) [21]
1214925951724/       386722941415 = 3.14159265358979323846276760198188357586 (-0.00000000000000000000012421870238069167) [21]
1235979294865/       393424428674 = 3.14159265358979323846276102935860166316 (-0.00000000000000000000011764607909877897) [21]
1257032638006/       400125915933 = 3.14159265358979323846275467689777075912 (-0.00000000000000000000011129361826787493) [21]
1278085981147/       406827403192 = 3.14159265358979323846274853371947582775 (-0.00000000000000000000010515043997294356) [21]
1299139324288/       413528890451 = 3.14159265358979323846274258964906440289 (-0.00000000000000000000009920636956151870) [22]
1320192667429/       420230377710 = 3.14159265358979323846273683516091186105 (-0.00000000000000000000009345188140897686) [22]
1341246010570/       426931864969 = 3.14159265358979323846273126132748294414 (-0.00000000000000000000008787804798005995) [22]
1362299353711/       433633352228 = 3.14159265358979323846272585977311658438 (-0.00000000000000000000008247649361370019) [22]
1383352696852/       440334839487 = 3.14159265358979323846272062263203084137 (-0.00000000000000000000007723935252795718) [22]
1404406039993/       447036326746 = 3.14159265358979323846271554251010510785 (-0.00000000000000000000007215923060222366) [22]
1425459383134/       453737814005 = 3.14159265358979323846271061245004906498 (-0.00000000000000000000006722917054618079) [22]
1446512726275/       460439301264 = 3.14159265358979323846270582589961333896 (-0.00000000000000000000006244262011045477) [22]
1467566069416/       467140788523 = 3.14159265358979323846270117668253641040 (-0.00000000000000000000005779340303352621) [22]
1488619412557/       473842275782 = 3.14159265358979323846269665897195688730 (-0.00000000000000000000005327569245400311) [22]
1509672755698/       480543763041 = 3.14159265358979323846269226726605047424 (-0.00000000000000000000004888398654759005) [22]
1530726098839/       487245250300 = 3.14159265358979323846268799636567745111 (-0.00000000000000000000004461308617456692) [22]
1551779441980/       493946737559 = 3.14159265358979323846268384135384972222 (-0.00000000000000000000004045807434683803) [22]
1572832785121/       500648224818 = 3.14159265358979323846267979757684694309 (-0.00000000000000000000003641429734405890) [22]
1593886128262/       507349712077 = 3.14159265358979323846267586062682924856 (-0.00000000000000000000003247734732636437) [22]
1614939471403/       514051199336 = 3.14159265358979323846267202632581000779 (-0.00000000000000000000002864304630712360) [22]
1635992814544/       520752686595 = 3.14159265358979323846266829071086609244 (-0.00000000000000000000002490743136320825) [22]
1657046157685/       527454173854 = 3.14159265358979323846266465002047559662 (-0.00000000000000000000002126674097271243) [22]
1678099500826/       534155661113 = 3.14159265358979323846266110068188399415 (-0.00000000000000000000001771740238110996) [22]
1699152843967/       540857148372 = 3.14159265358979323846265763929940953314 (-0.00000000000000000000001425601990664895) [22]
1720206187108/       547558635631 = 3.14159265358979323846265426264360740155 (-0.00000000000000000000001087936410451736) [22]
1741259530249/       554260122890 = 3.14159265358979323846265096764121998082 (-0.00000000000000000000000758436171709663) [23]
1762312873390/       560961610149 = 3.14159265358979323846264775136584745085 (-0.00000000000000000000000436808634456666) [23]
1783366216531/       567663097408 = 3.14159265358979323846264461102927921823 (-0.00000000000000000000000122774977633404) [23]
3587785776203/      1142027682075 = 3.14159265358979323846264306850252143875 ( 0.00000000000000000000000031477698144544) [24]
5371151992734/      1709690779483 = 3.14159265358979323846264358066268961876 (-0.00000000000000000000000019738318673457) [24]
8958937768937/      2851718461558 = 3.14159265358979323846264337555790890412 ( 0.00000000000000000000000000772159398007) [26]
77042654144230/     24523438471947 = 3.14159265358979323846264338985711710108 (-0.00000000000000000000000000657761421689) [26]
86001591913167/     27375156933505 = 3.14159265358979323846264338836754332073 (-0.00000000000000000000000000508804043654) [26]
94960529682104/     30226875395063 = 3.14159265358979323846264338715903365924 (-0.00000000000000000000000000387953077505) [26]
103919467451041/     33078593856621 = 3.14159265358979323846264338615889617509 (-0.00000000000000000000000000287939329090) [26]
112878405219978/     35930312318179 = 3.14159265358979323846264338531751659443 (-0.00000000000000000000000000203801371024) [26]
121837342988915/     38782030779737 = 3.14159265358979323846264338459987358119 (-0.00000000000000000000000000132037069700) [26]
130796280757852/     41633749241295 = 3.14159265358979323846264338398054099481 (-0.00000000000000000000000000070103811062) [27]
139755218526789/     44485467702853 = 3.14159265358979323846264338344061241435 (-0.00000000000000000000000000016110953016) [27]
288469374822515/     91822653867264 = 3.14159265358979323846264338319580081089 ( 0.00000000000000000000000000008370207330) [28]
428224593349304/    136308121570117 = 3.14159265358979323846264338327569743446 ( 0.00000000000000000000000000000380544973) [29]
3137327371971917/    998642318693672 = 3.14159265358979323846264338328304372840 (-0.00000000000000000000000000000354084421) [29]
3565551965321221/   1134950440263789 = 3.14159265358979323846264338328216143479 (-0.00000000000000000000000000000265855060) [29]
3993776558670525/   1271258561833906 = 3.14159265358979323846264338328146834546 (-0.00000000000000000000000000000196546127) [29]
4422001152019829/   1407566683404023 = 3.14159265358979323846264338328090949305 (-0.00000000000000000000000000000140660886) [29]
4850225745369133/   1543874804974140 = 3.14159265358979323846264338328044932237 (-0.00000000000000000000000000000094643818) [30]
5278450338718437/   1680182926544257 = 3.14159265358979323846264338328006381618 (-0.00000000000000000000000000000056093199) [30]
5706674932067741/   1816491048114374 = 3.14159265358979323846264338327973616618 (-0.00000000000000000000000000000023328199) [30]
6134899525417045/   1952799169684491 = 3.14159265358979323846264338327945425704 ( 0.00000000000000000000000000000004862715) [31]
17976473982901831/   5722089387483356 = 3.14159265358979323846264338327954374978 (-0.00000000000000000000000000000004086559) [31]
24111373508318876/   7674888557167847 = 3.14159265358979323846264338327952097924 (-0.00000000000000000000000000000001809505) [31]
30246273033735921/   9627687726852338 = 3.14159265358979323846264338327950744587 (-0.00000000000000000000000000000000456168) [32]
36381172559152966/  11580486896536829 = 3.14159265358979323846264338327949847672 ( 0.00000000000000000000000000000000440747) [32]
66627445592888887/  21208174623389167 = 3.14159265358979323846264338327950254837 ( 0.00000000000000000000000000000000033582) [33]
230128609812402582/  73252211597019839 = 3.14159265358979323846264338327950319206 (-0.00000000000000000000000000000000030787) [33]
296756055405291469/  94460386220409006 = 3.14159265358979323846264338327950304754 (-0.00000000000000000000000000000000016335) [33]
363383500998180356/ 115668560843798173 = 3.14159265358979323846264338327950295601 (-0.00000000000000000000000000000000007182) [34]
430010946591069243/ 136876735467187340 = 3.14159265358979323846264338327950289285 (-0.00000000000000000000000000000000000866) [35]
1356660285366096616/ 431838381024951187 = 3.14159265358979323846264338327950287593 ( 0.00000000000000000000000000000000000826) [35]
1786671231957165859/ 568715116492138527 = 3.14159265358979323846264338327950288000 ( 0.00000000000000000000000000000000000419) [35]
2216682178548235102/ 705591851959325867 = 3.14159265358979323846264338327950288250 ( 0.00000000000000000000000000000000000169) [35]
2646693125139304345/ 842468587426513207 = 3.14159265358979323846264338327950288418 ( 0.00000000000000000000000000000000000001) [37]```

https://polska.pl/science/famous-scientists/abraham-stern-17691842-does-anyone-have-calculator/

Portrait of Abraham Stern (with one of his calculating machines) from 1823, artist Jan Antoni Blank
The problem with the world is that the intelligent people are full of doubts, while the stupid ones are full of confidence.
Charles Bukowski

### Abraham Jakub Stern

The Polish Jew Abraham Jakub Stern (see biography of Abraham Stern), a mathematician, inventor, translator, and censor, was born in 1768 in Hrubieszów, in a poor Jewish family. Around 1800, while working at a clockmaker’s shop in his home town, he was lucky to be noticed by Stanisław Wawrzyniec Staszic (1755-1826), a leading figure in the Polish Enlightenment: a Catholic priest, statesman, philosopher, geologist, writer and translator. Staszic, who studied at the Hrubieszów and Lublin secondary school in early 1770s, bought an estate in Hrubieszów in 1800. Staszic obviously noticed the extraordinary talent of the humble clockmaker and encouraged him to devote himself to the study of mathematics, Latin, and German, later sending him to Warsaw to continue his studies.

His first computing machine Stern designed around 1810, and in 1811 he sent a report to Staszic, outlining the device and asking for financial help. Later on he designed two more calculating devices. His inventions became popular, at the time of their development. In 1816 and 1818, Stern demonstrated his machine to the Tzar of Russia Alexander I, who received him cordially and granted him an annual pension of 350 roubles, promising, in case of his death, to pay half of this sum to his widow.

For his inventions, Stern was admitted to the Warsaw Society of the Friends of Science (Warszawskiego Towarzystwa Przyjaciół Nauk, the predecessor of the Polish Academy of Sciences), first as a corresponding member (1817), then as a qualifying member (1821), and finally as a full member (1830). He presented his inventions multiple times at the Society’s meetings.

Stern was a father-in-law and heavily influenced another inventor of calculating machines—Chaim Zelig Slonimski. Stern most probably had strong influence also over another Polish Jew and inventor—Izrael Abraham Staffel.

Stern presented to the Society three calculating machines. First machine for four arithmetic operations was designed around 1810 and was presented on 7 January, 1813, then a different machine for extracting square roots (presented on 13 January, 1817), and finally the combined machine for four operations and square roots (30 April, 1818). A lengthy description of Stern’s machines had been given by himself, and you can see it below. Unfortunately, an original of any of his machines did not survive to the present time, only a later replica of one of the machines, shown below. There is also a low quality reproduction of Stern with one of his calculating machines (see below).

Abraham Izrael Stern with his calculating machine (upper image)
and a later replica of the machine itself (lower photo) (© Science Museum, London)

The only detailed description of the Stern’s machines is his treatise, prepared for the presentation to the Warsaw Scientific Society of the combined machine for four operations and square roots on 30 April 1817, which you can see below (translated from Polish language by Phil Boiarski and Janusz Zalewski):

***
Traetise on an Arithmetic Machine

For the third time, in this earnest place of gatherings of the Society, comprising a selection of learned and enlightened men, I reveal the fruits of my thought—first in the month of January 1813, I presented an invention of a Machine for 4 arithmetic operations—secondly, in January of the current year 1817, an invention of the Machine for extracting roots with fractions—and then finally today, the 30th of April of the current year 1817, an invention combining both these Machines into a single one.

This memorable day is the anniversary of establishing the glorious Warsaw Society of the Friends of Science and honoring it with the title of a Royal Society. I consider myself to be extremely fortunate that on this celebrated day, I can report concerning my inventions, regarding both their historical development and the thought that inspired them as well as regarding properties of said Machines.

Remarks that initially led me to this thought are the following:

A man, although he comes into the world without any means to meet his inevitable needs, nevertheless, being above all creatures with his invaluable gift of mind elevated, with his unlimited ingenuity, uncountable for meeting his stressing needs finds means; because every need he feels inspires him to search and devise means corresponding to this need. That Man, feeling his superiority, thinks that all of nature for his benefit and service has been created and to him subdued; as the Psalmist with astonishment about a man says:

For thou hast made him a little lower than the angels, and hast crowned him with glory and honor. Thou modest him to have dominion over the works of thy hands; thou hast put all things under his feet: All sheep and oxen, yea, and the beasts of the field; The fowl of the air, and the fish of the sea, etc.

This feeling of his undetermined power over all of nature, causes him to regard whatever he finds for himself in nature useful for his needs, even though it should be considered more as a luxury. Experience teaches us that many things that initially were luxurious only because of having been used by a small number of people, with time, however, have become so common that they have shifted from the level intrinsic to luxury to the level of essential need.

From all this ensues, that when in the mankind a number of needs increases, then by this very thing, the ingenuity in methods and means to meet these needs has to multiply. Since such means are commonly based on physical acts, that is, works of the body, which often become onerous, or even beyond human power, so in such case the mind, as a primary Leader of the Man, makes every effort to invent intermediary means to replace the work of the body, or at least to ease it. Following this purpose, numerous mechanical tools have been invented to protect human physical power and support it.

From the depths of this convincing truth, another one equally undeniable have I drawn, that while no one spared efforts to bring assistance and relief to the physical condition, it becomes at least equally necessary to launch a search for mechanical means, which would offer help in human mental activities and relieve the intensity of thought; since the intensity of thought, as it is known, not only often impairs the subtlety of organs, deadens the wit, degrades memory, but also, even causes weakening of the body.

I considered arithmetic or calculation science such a mental activity, one that was necessary, but through an intensity of thought, one that could be harmful. In it, the first 4 types of operations, that is, Addition, Subtraction, Multiplication and Division are the main principles of all calculations, insofar that all other calculi are only the result of combinations of the said 4 kinds. And even though all 4 arithmetic operations, in general, require an uninterrupted presence of mind, that if for a moment gets distracted, a calculation cannot be accurate; since Multiplication and Division, for the reason of higher and more continuous intensity of thought, turn out to be the most difficult ones and therefore so often are subjected to errors.

Abraham Stern demonstrating one of his calculating machines in Warsaw (at public sittings of the Friends of Sciences Society, Stern’s Jewish clothes among black tailcoats worn by his colleagues always puzzled people who were not aware who he was)

At this point, I think it would not be unusual, regarding calculation errors, to make the following remark:

In regular calculus, we do not have and even cannot have a test convincing us whether any error was made; this is because a test performed by a reverse calculation, for example, Multiplication by Division, or Division by Multiplication, does not yet constitute a sufficient proof, for it means to test a mental activity by another mental activity. Since the repeated mental action, intending to serve as a test, is subjected to an equal error as in the primary calculation, this error in a test could have obstructed an error made in the calculation itself, and made it invisible.

All these remarks became for me the reason to invent an arithmetic Machine based on mechanical and arithmetic principles, with the assistance of which even people knowing only counting and numbers, all 4 kinds of calculations, and therefore all the other calculi, without slightest application of thought to it, easily could accomplish. And because I thought it to be just, in such an important subject, not to rely solely on the principles of the theory of the mechanism, on which my invention has been based, insofar the slightest error in these principles could have disproved the entire construction of the invention, therefore for better conviction, I elaborated for testing a model of such an arithmetic Machine that worked. And even though the Machine was not of a durable construction, and the required accuracy in the first, rough design, could not have been achieved, however, it exactly performed all the arithmetic calculations, so far that it proved the reality of this important invention.

In the month of December 1812, I submitted this invention for the consideration of the respectable Royal Warsaw Society of the Friends of Science. This Eminent Society, having assessed the invention as corresponding fully to its purpose, deigned to deliver to the public a message about it, in its gathering on January 1813.

I have stated then, that I have planned to make another Machine, made of metal, in a way strong and durable. And although such an endeavor in particular at the initial stage, required time and significant funds for covering expenses, which by then a critical war situation of the Polish state, of which I am a compatriot, made it even more difficult for me, however, not saving efforts on my part, this statement of mine I have put into effect, so far that working continuously on this invention, I have finished a Machine for 4 fundamental arithmetic operations, completely of metal made with the finest work, and performing 13 digit operations.

In conjunction with work on this arithmetic Machine, I also worked on another, by far more difficult, invention of a Machine for extracting roots with fractions. The difference existing between only arithmetic operations and extracting roots already shows the level of difficulty; because in the former, there are always at least two known numbers given and the third unknown is searched for, but in extracting roots, there is only one known number given, and the other one unknown, that multiplied by itself equals the given number. Admittedly I learned that I ventured into such an abyss, from which a recovery is subject to numerous difficulties, both regarding the implementation of a thought as well as the huge costs, which a carefully elaborated plan definitely required. But no difficulty could oppose my keen willingness to finish the invention, which from various points of view seems to be important, both for the intention proper, the relief in the intensity of thought and counteracting unintentional errors, and to create a completely new mechanical means, included in this invention, which could apply great benefits to mechanical tools in other objects.

Thanks to the Almighty, I have passed this difficult and dangerous path, too, and the Machine for extracting roots with fractions I have led to the intended goal. This invention just like the first one, I have submitted for consideration of the glorious Royal Society of the Friends of Science, about which the public has been informed at the past January gathering of this Society.

This way, then, these two inventions, two separate Machines have been formed, one for the 4 arithmetic operations, and another for extracting roots.

I began thinking further on the ways, which these two inventions could combine into a single Machine. It seemed to me, initially, to be impossible, indeed. But finally, at this point mechanical ingenuity showed me the means to put my intent into effect. The importance of this thought so overwhelmed me, that to all unpleasant things stemming from shortages I have been insensitive, doubling my efforts, so I could make this combination sooner.

Thanks to the Highest Being, in this subject I did not fail either. I can say this boldly, since I am referring to the convincing proof, that is, before our eyes: a Machine, which accurately performs all 4 arithmetic operations, as well as extracting roots.

If I wanted to venture into the details and explanations of all principles of the internal Mechanism of this Machine, the purpose would be missed. Because the Mechanism, comprising several wheels of various kind, rotations of a new type, springs and levers, by various means connected with each other, requires an extensive description and many figures, which will be the subject of a work planned for a later date, with figures clearly presenting the matter, but in this treatise clarifications would be an excessive boring of the respected public. Therefore I am moving now only to a brief sketch of the Machine and the explanation of the way of using it, in various arithmetic operations and extracting roots, as well as doing a foolproof test.

This Machine has a shape of a parallelepiped, longish and rectangular, in its length by five rows of wheels divided. The first, uppermost row, just like the second one underneath, is composed of 13 wheels based on axles. The wheels of the first uppermost row have discs, on which there are engraved ordinary digits of numbers of which only one number over the aperture is visible. Because the numbers of these wheels correspond to positions of units, tens, hundreds, etc., thus, this row entails trillions. Wheels of the second row, in turn, do not have discs, and serve only as the Mechanism offering movement to the uppermost numerical wheels. Both these rows do not change (their place in the Machine. Behind these rows underneath, there are two rows of wheels, which similarly have numerical discs visible through the apertures, and are placed in a separate base in the shape of a carriage. This carriage with its two rows of wheels is so imbedded in the Machine, that it can easily move on smaller wheels or rollers. The first row in this carriage has 7 numerical wheels on the axles of which there are as many folding cranks. Besides these cranks, on the diameter of a folding crank, there is another main crank which can be inserted and removed. The second row underneath has 8 wheels. Below this carriage, there is a lowermost row, composed of 7 wheels, equipped with numerical discs visible through apertures. This row has a stable and invariable place in the Machine. In addition to these rows of wheels, in top of the Machine there are two more rows of wheels, on which Roman numerals are engraved, visible through apertures. One of these rows has its place above the ordinary numerical apertures of the uppermost row, and the other above the ordinary numerical apertures of the lowermost row.

The way of using the Machine in operations is the following:
When any of the 4 arithmetic operations is to be performed, then one has to move to the left the handle positioned on the right on the carriage at the second row. As a result of this move, on a carriage on the left hand side, the word Species shows through an aperture, all the numerical apertures of the second row are covered, and thus the Machine is ready for 4 arithmetic operations. If the species of an operation is to be addition or multiplication, then with two handles at the ends of the Machine devised at the right and left hand sides, one moves up, while at the same time the words: Addition – Multiplication show up on the Machine through apertures, and by this the Machine is ready for these operations. If the species of the calculation were subtraction or division, this is done by moving from top down, the same way, words Addition and Multiplication disappear, and in their place the words Subtraction – Division are seen and the Machine by this is ready for to the said operations.

In calculations of addition or subtraction, one puts the first number known participating in the problem, in the uppermost row, and the other one in the first crank row on a carriage. The operation is performed by the main crank in the middle of the carriage base, which gives movement to the entire Machine. If only a single circular rotation is performed, then the brake, located on the left hand side of the carriage, stops further movement of the Machine, and at the same time the unknown number searched for, appears in the uppermost row through apertures as a result of the operation.

In addition, there is one more convenience in the Machine, that is, because in this type of operation it happens that more than two rows of numbers in the calculation have to be combined, for example, in Registers and Tables, one sets on the Machine the first two given rows, as mentioned above, and by making a single circular rotation of a crank the Machine brakes, one touches the brake with a finger, and the Machine becomes available for rotation. Furthermore, one sets, in the third crank row on a carriage, the third given row, and rotates the crank once again, and so on, acting so until all the given rows are exhausted; at that time, in the uppermost row there will be a general Sum of all the given numbers. However, to prevent an error from squeezing in, when all these different numbers are being added, which can especially happen when the operator interrupts the work, the Machine shows, through the aperture, the number corresponding to the value of how many given rows have been taken to the operation thus far.

Multiplication is performed in the following way. One of the factors is set on the crank row in a carriage, and the other on the lowermost row, while on the uppermost row, which is designated for the product searched for, zeroes are placed. After that, one moves the carriage from the right to the left side, to the very end of the Machine, by the handle placed on the left hand side of the carriage. After releasing the handle, the carriage returns by itself, and stops in a position resulting from the nature of the problem. In this position one begins rotation of the main crank. During the rotation, the carriage moves by itself from one number to the other towards the right hand side, back through the end of the Machine; over there, the operation lasts until the ring of a bell warns about the operation’s completion, while at the same time the product searched for appears already on the uppermost row. In this species of operation, the Machine has a particular superiority over calculations in an ordinary manner, that from several given multiplications one can obtain a general product without performing an addition operation, that is, without combining individually calculated products together. This is because in an ordinary calculation, in such case, one has to first calculate a separate product from every two factors, then collect all individual products together and, by addition, derive the general product. On the Machine, however, one sets the first task and operates as long as the ring of a bell indicates to stop; not paying attention to the value of a product, one sets the second task, third, and so on, and when after the last operation the ring of a bell indicates to stop the rotations, at that time the general product of all the tasks appears on the uppermost row.

In division, one proceeds in the following way. The dividend is set on the uppermost row, and the divisor on the crank row in the carriage, while on the lowermost row, designated for the quotient, zeroes are placed. The carriage moves towards the left hand side, until the divisor stands straight under the dividend number being greater or at least equal to the divisor. Then a main crank rotation begins and lasts as long as the dividend number becomes smaller than the divisor, at which point one presses with finger a flap situated on the right hand side of the carriage, after which the carriage moves by itself towards the right hand side and stops at the appropriate place, where further operation continues in a similar manner till the end of the job. And when the divisor located on the carriage, standing in its first place, that is, at the end of the Machine on the right hand side, carries the dividend, then the operation is to stop and the quotient appears on the lowermost row. In case there is a fraction, then the numerator appears on the uppermost row and the denominator on a crank row in the carriage. If on the uppermost row there are only zeroes, this means that quotient is a whole number, without a fraction.

I am now going to describe the way of extracting roots.

If one wants to extract a square root from a given number, first, one has to move to the right the handle at the second row on the right hand side of the carriage, so that at the left hand side of the carriage, the word Species disappears and is replaced by the word Radices in the aperture. Then, numerical apertures of the second row of the carriage open, and the Machine is ready for extracting roots. Next, one has to move the handles at the ends of the Machine from top down, where between the inscriptions Subtraction – Division, one can also see on the Machine the word Extraction. The main crank in the middle of the carriage has to be removed, and the smaller folding cranks replace it in the operation. On the uppermost row, one sets up the known number of a given square, and on the first and second rows of the carriage all zeroes, except at the position of units in the second row, where one places the number 1. At the apertures for ordinary numbers of the uppermost row there are various signs dividing this row into sections, in such an order that for every two numerical wheels there is a sign, that is, at units, hundreds, tens of thousands, millions, and so on. On the said cranks there are identical signs, so that each crank corresponds to two wheels of the uppermost row, for example, the first crank from the right corresponds to units and tens, the second one – hundreds and thousands, and so on. The last sign, at the given number of the square, points to the crank from which the operation has to start, for example, if the given square ends on the wheels of the first sign, then the operation has to be undertaken with the first crank on the right hand side. If, however, a given square ended on the wheels of the second sign, the operation then begins with the second crank, having the same sign. Indicated this way the folding crank unfolds, the carriage moves to the left until the unfolding crank stops in front of the last sign of a given square. The rotation is conducted with this unfolding crank and lasts as long as the number on the uppermost row, in front of the rotating crank, becomes smaller than or, at least, equal to the number positioned in front of the same crank on the second row of the carriage. Next, by folding this crank, the crank to the right of it unfolds, and by pressing with finger a flap on the right hand side of the carriage, the carriage moves by itself to the right hand side, until it is stopped by a folded crank, just in front of the previous section. One performs the same operations as above, for each section up to the last one. After completing the operation, if a given number was a full square, then it is replaced by zeroes and the whole root on the crank row in the carriage appears. Otherwise, except of the whole number root, an additional fraction results, namely, the numerator on the uppermost row and the denominator on the second row in the carriage.

To approximate the root in decimal fractions, one has to set on the uppermost row as many sections to zeroes as decimal digits in a fraction we want to have, for example, if the root is to be extracted from the number 7, and a fraction approximated with two decimal digits, one sets two sections, or four wheels, to zeroes, and the given number 7 is set in the third section, that is, on the 5th wheel of the uppermost row. To distinguish between the number actually given and the zeroes attached to it, a moving hand always slides out under this sign where the actual number has been set, which warns the operator how many digits for a decimal fraction he has to cut from the right hand side on the crank row in the carriage. This way, then, when a given number is under the third sign, one has to unfold the third crank and perform the operations as above. The root will, then, result on the 3 crank wheels in the carriage, as number 264. Cutting off, following the hand’s warning, two digits for a decimal fraction, will mean 2 wholes and 64 hundredths. In addition to that, in the uppermost row there is number 304, as a numerator, and in the second row of the carriage, 529, as a denominator of the ordinary fraction of the tenth units of the first order.

At the beginning of this treatise I explained that in our ordinary calculus, there is no convincing test that in our mental operation there was not any error, and that a way of testing by reverse calculation does not constitute a sufficient proof. The same remark applies to the operation of the Machine. In case the Machine, due to damage, produced a false result, then the test by a reverse operation would not be proving, because the same damage that caused a false result in Multiplication, for example, would have affected a false result in Division, which would be clear even from the composition of the Machine. But to remedy this, I devised for the Machine a completely different kind of test, which is an absolute proof.

Two rows of wheels with Roman numerals, mentioned above, located on top of the Machine, are designed for this purpose.

To obtain such a reliable test, one proceeds as follows.

Regarding Multiplication: Since from all the factors, the numbers of the factor on the uppermost row disappear during the work, being replaced by zeroes, so to make it visible after the work, what factor was a part of the problem, one sets in advance, on a Roman numbered row located above the apertures of the lowermost row, the signs corresponding to the digits of the factor which is to disappear. And because after completing the operation, the product results on the uppermost row and on the lowermost row there are zeroes, so one shifts as many zeroes to the number 9 as the number of digits of the remaining factor in the Roman numbered row, except of the first digit, being meaningful, at the right hand side of the factor, where the appearing zero remains. After this, the operation of testing begins. The carriage moves to the left and will stop by itself at the last number 9, but the rotation lasts as long as the number appears that is equal to the Roman numeral right above it, that is, the same which previously disappeared. At that time, one presses the flap on the right hand side of the carriage, the carriage moves to the right and the rotations proceed further, as before, until a given factor fully appears on its first place, that is, on the lowermost row.

If after this work it turns out that there are as many digits in the factor on the lowermost row, as the number of zeroes on the uppermost row, at the right hand side, and the numbers following them are equal to the numbers on the crank row of the carriage, then it is an absolute proof that the first product was true, otherwise it had been false.

In Division, the test is conducted as follows. When the dividend is set on the uppermost row, at the same time on the Roman numerals row above it, the same signs are set, so that the dividend, which disappears during the work, that way be preserved; and when after completion of the division, the quotient appears on the lowermost row, it is then moved to the Roman numerals row directly above it. Then one moves the carriage towards the left hand side, until the first number of units on the carriage appears in front of the last number of the quotient. The rotation is conducted in this place, until the carriage moves by itself towards the right hand side, and this continues number by number, until the carriage moves to the first number of units, where one has to rotate as long as a zero appears. After completing this operation, one moves the carriage to the left as long as its first number of units passes all the digits of the preserved quotient. When the carriage is moved thus far, one has to watch if the number on the uppermost row, which has now formed, combined with the number on the crank row of the carriage, will match the dividend preserved on the Roman numerals row, which is a positive proof that the first result was true, otherwise it had been false.

The test of extracting roots is conducted the same way as in the division, except that right before making the test, one has to adjust the Machine from the state of extracting roots to the state of arithmetic operations, and then set on the lowermost row the number equal to the root resulted in the crank row of the carriage. The remaining steps and the proving test are the same as in the division testing.

Remark: I am ending this treatise with a remark that since the Mechanics is an Opener to [meeting] our needs, insofar that not only our physical power but even that of mental power can replace, thus we should put our strongest effort to propagate ingenuity in such a broad and useful field, not venturing, however, into a search for perpetuum mobile, that is, an eternal motion, since this is an incurable disease of Mechanics, just as a philosopher’s stone and an inextinguishable fire in Chemistry, and a squaring of a circle in Geometry—all thoughts in this area have an attribute of an ineffective stubbornness. Let us better strive to make progress in mechanical matters showing promise, since such conduct paves the way to the well-being and glory of the Nation.

https://www.computerhistory.org/atchm/who-invented-the-microprocessor/

# Who Invented the Microprocessor?

Sep 20, 2018

Intel Pentium microprocessor die and wafer

The microprocessor is hailed as one of the most significant engineering milestones of all time. The lack of a generally agreed definition of the term has supported many claims to be the inventor of the microprocessor. The original use of the word “microprocessor” described a computer that employed a microprogrammed architecture—a technique first described by Maurice Wilkes in 1951. Viatron Computer Systems used the term microprocessor to describe its compact System 21 machines for small business use in 1968. Its modern usage is an abbreviation of micro-processing unit (MPU), the silicon device in a computer that performs all the essential logical operations of a computing system. Popular media stories call it a “computer-on-a-chip.” This article describes a chronology of early approaches to integrating the primary building blocks of a computer on to fewer and fewer microelectronic chips, culminating in the concept of the microprocessor.

https://www.dos4ever.com/flyback/flyback.html

# Flyback Converters for Dummies

## Ronald Dekker

### Special thanks to Frans Schoofs, who really understands how flyback converters work

 If you are interested in Flyback Converters you might want to keep track of my present project: the µTracer: a miniature radio-tube curve-tracer

## introduction

In the NIXIE clocks that I have built, I did not want to have the big and ugly mains transformer in the actual clock itself. Instead I use an AC adapter that fits into the mains wall plug. This means that I have to use some sort of an up-converter to generate the 180V anode supply for the NIXIEs.

This page describes a simple boost converter and a more efficient flyback converter both of which can be used as a high voltage power supply for a 6 NIXIE tube display. Frans Schoofs beautifully explained to me the working of the flyback converter and much of what he explained to me you find reflected on this page. I additionally explain the essentials of inductors and transformers that you need to know. This is just a practical guide to get you going, it is not a scientific treatise on the topic.

## What you need to know about inductors

Consider the simple circuit consisting of a battery connected to an inductor with inductance L and resistance R (Fig. 1). When the battery is connected to the inductor, the current does not immediately change from zero to its maximum value V/R. The law of electromagnetic induction, Faraday’s law prevents this. What happens instead is the following. As the current increases with time, the magnetic flux through this loop proportional to this current increases. The increasing flux induces an e.m.f. in the circuit that opposes the change in magnetic flux. By Lenz’s law, the induced electric field in the loop must therefore be opposite to the direction of the current. As the magnitude of the current increases, the rate of the increase lessens and hence the induced e.m.f. decreases. This opposing e.m.f. results in a linear increase in current at a rate I=(V/L)*t. The increase in current will finally stop when it becomes limited through the series resistance of the inductor. At that moment the amount of magnetic energy stored in the inductor amounts to E=0.5*L*I*I.

Figure 1

http://www.crisvandevel.de/

## Cris’ site on antique mechanical four-function calculators

http://public.beuth-hochschule.de/~hamann/

 Prof. Dr.-Ing. Christian-M. Hamann ( verantwortlich für den Inhalt dieser HomePage ) Spezialgebiete: ( ehem. Leiter des Labors für Künstliche Intelligenz ) LISP & PROLOG Anwendung Expertensysteme Evolutionsstrategien & Genetische Algorithmen Simulation Neuronaler Netze Positionale Logische Algebra Fuzzy-Logik OVER 777 HISTORICAL/TECHNICAL OBJECTS TO EXPLORE … THE ON-LINE MUSEUM RechenMaschinen, RechenStäbe, … , SchreibMaschinen, Telefone, Uhren

https://www.computerhistory.org/babbage/howitworks/

# How it Works

© Computer History Museum | Credits

http://w-hasselo.nl/mechn/

# Wim Hasselo

## Early Canon Desktop Calculators

### Contents

Canon “Canola” Model 167P, 1971

http://www.azillionmonkeys.com/qed/sqroot.html

# √Square Roots

## 1. Compute a square root now!

Enter a non-negative number in the Input field below:

http://www.vcalc.net/hp-code.htm

# Microcode: Electronic Building Blocks For Calculators

Last Update: January 22, 2003 — THE HP REFERENCE

Hewlett Packard Personal Calculator Digest
Vol. 3, 1977
pp 4-6

Just as DNA can be called the building blocks of the human organism. Microcode can be called the building blocks of the electronic calculator.

But, while the way the human organism works is determined by heredity, the way an electronic calculator works is determined by the highly personal–and even idiosyncratic–creative impulses of a programming specialist.

The principles of programming can be learned, of course. But anyone who has programmed his own calculator quickly discovers that techniques may vary widely from person to person.

Consider the challenge faced by the professional programmer When you press the key labeled SIN, for example, you expect the calculator to display the sine of the value you have keyed into it–and presto, it does. But in that less-than-a-second interval between keystroke and display, the calculator has executed an internal program of about 3500 steps. And it does this according to the highly individualistic microcode that the programmer has created.

The development of microcode in Hewlett-Packard personal calculators began with the development developments of the microprocessor in the HP-35–and not coincidentally, since both were developed by the same Hewlett-Packard engineer.

It is the microprocessor that determines the “language” of the internal microcode. If you are familiar with computer languages such as BASIC, FORTRAN, and COBOL, you know that these languages structure the way you write your program on the computer. You can only do what the language lets you do.

The microprocessor is similar to the computer. It provides a language that a clever engineer can then build into a function on the keyboard.

The original HP-35 microprocessor has remained essentially unchanged through the years and is the heart of the new HP-19C and HP-29C. Compared with computer processors, the binary-coded decimal microprocessor is very simple it does not handle byte data well, but is, in fact, specially designed for 10-digit floating point numbers (See figure 1.). The resulting microcode language most closely resembles machine language, which is programming at as most basic level.

Most microprocessors use 8-bit instructions and two or three instructions are usually combined to perform one operation. The beauty of HP’s calculator microcode is that 10-bit instructions are used and each usually performs a complete operation by itself.

The language’s strongest point is its robust arithmetic section of 37 instructions combined with eight field-select options. The field-select options allow the program to apply the instruction to any word-select portion of the register (See figure 2.).

The language is also designed to use very little storage; only seven registers were used in the HP-35 of which five were user registers. This is done to reduce costs and to save valuable space. For the design engineer it means that he must accomplish all of his miracles within the program itself.

Based on warranty card analysis and other market research, parameters regarding the desired function set and price are given to the design engineer. It is his job then to determine the specific functions for the calculator and to attempt to fit them in the allotted memory. Price is an important factor to the engineer because it directly influences the amount of memory he has to work with.

Only after several months of hard work writing and compacting microcode will he know if the function set will fit. If it is not possible, the product may be redefined at a higher price with greater performance to increase the available memory. More likely however, the engineer will be forced to pare functions until his program fits.

To give you an idea of how much memory is required, the HP-35 used three pages of 256 instructions each. Each page required a separate ROM read-only memory.

• HP-35—-768 instructions–3/4 quad (3 pages of 256 instructions each)

The HP-45 originally took six pages of instructions. But about that time the quad ROM was developed, which, as its name implies, was the equivalent of four conventional ROM’S. So for the HP-45, two quad ROM’s were used. It was in the leftover two pages that an enterprising designer placed the celebrated HP-45 clock. Later calculators, listed below, continued to use quad ROM’s.

Note: 1 quad = 1024 instructions or 4 pages of 256 instructions each. -Rick-

Writing the microcode is where the designer’s personality is stamped indelibly on the Calculator. While it is true that the fundamental algorithms for commuting the complex mathematical functions found in HP personal calculators have remained essentially the same since the HP-35, the individual code is substantially different.

```+---------------------------------------------------------------------+
| NUMBER   REGISTER REPRESENTATION                                    |
|                                                                     |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|   23.4   | 0 | 2 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |  |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|                                                                     |
|            a                                                        |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|  -123.   | 9 | 1 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |  |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|                                                                     |
|                                                                   b |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|  .002    | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 9 | 7 |  |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|                                                                     |
|          a. A "9" in the sign position indicates a negative number. |
|          b. Exponents are kept in 10's complement form.             |
|                                                                     |
+---------------------------------------------------------------------+
|  Figure 1. All numbers in registers are in scientific notation with |
|  the mantissa portion of the number left justified in the mantissa  |
|  portion of the register.                                           |
+---------------------------------------------------------------------+
```
```+---------------------------------------------------------------------+
|          mantissa sign                               exponent sign  |
| pointer    |                                           |            |
| positions+---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|    ----> |13 |12 |11 |10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |  |
|          +---+---+---+---+---+---+---+---+---+---+---+---+---+---+  |
|          |   |                                       |           |  |
|          |   |                                       |           |  |
|          |   +------------- mantissa ----------------+- exponent +  |
|          |                                           |           |  |
|          +----------- mantissa and sign -------------+           |  |
|          |                                                       |  |
|          +----------------------  word --------------------------+  |
|                                                                     |
+---------------------------------------------------------------------+

+---------------------------------------------------------------------+
|Instructions                                                         |
+--------------+------------------------------------------------------+
|  A = 0       |                                                      |
|  B = 0       | Clears word-select portion.                          |
|  C = 0       |                                                      |
+--------------+------------------------------------------------------+
|  B = A       | Copies word-select portion                           |
|  C = B       |   from specified register                            |
|  A = C       |   to specified register.                             |
+--------------+------------------------------------------------------+
|  AB EX       | Exchanges word-select portion                        |
|  AC EX       |   between specified registers.                       |
|  CB EX       |                                                      |
+--------------+------------------------------------------------------+
|  A = A + B   |                                                      |
|  A = A + C   |                                                      |
|  C = A + C   |                                                      |
|  C = C + C   |                                                      |
|  A = A - B   |                                                      |
|  C = A - C   |                                                      |
|  A = A - C   | Performs stated arithmetic                           |
|  A = A + 1   |   on word-select portion.                            |
|  C = C + 1   |                                                      |
|  A = A - 1   |                                                      |
|  C = C - 1   |                                                      |
|  C = -C      |                                                      |
|  C = -C - 1  |                                                      |
+--------------+------------------------------------------------------+
|  A SR        |                                                      |
|  B SR        | Shifts word-select portion right.                    |
|  C SR        |                                                      |
+--------------+------------------------------------------------------+
|              |                                                      |
|  A SL        | Shifts word-select portion left.                     |
|              |                                                      |
+--------------+------------------------------------------------------+
|              | Circular shifts whole A                              |
|  A SLC       |   register but does not have                         |
|              |   word-select option.                                |
+--------------+------------------------------------------------------+
|  ? B = 0     |                                                      |
|  ? C = 0     |                                                      |
|  ? A => C    | Tests word-select portion of                         |
|  ? A => B    |   given register.                                    |
|  ? A = 0     |                                                      |
|  ? C = 0     |                                                      |
+--------------+------------------------------------------------------+
| Field-Select Options                                                |
|                                                                     |
|  1. Mantissa (M)                                                    |
|  2. Mantissa and Sign (MS)                                          |
|  3. Exponent (X)                                                    |
|  4. Exponent Sign (XS)                                              |
|  5. Sign of Mantissa (S)                                            |
|  6. Pointer (P)                                                     |
|  7. Word (W)                                                        |
|  8. Word thru Pointer (WP)                                          |
|                                                                     |
+---------------------------------------------------------------------+
|  Figure 2. Registers are 14 digits long with each digit being four  |
|  bits. An additional four bit register is used as a pointer. The    |
|  programmer can set the pointer to any digit position, change that  |
|  digit or all digits up to that position.                           |
+---------------------------------------------------------------------+
```

Some of the major routines such as those for sine, cosine, and tangent are the same from calculator to calculator. But in most cases if the code cannot be borrowed exactly as it exists in another calculator, it must be rewritten.

Some designer begin by strictly flowcharting the entire program Others tackle the code directly and leave all but basic flowcharting till the end for documentation purposes only. The code is written originally on paper just as you would write a program on an HP programming pad. It is then either punched on cards or typed into a handy CRT.

The major task in writing the microcode is not only to have the functions produce the correct answers when a key is pressed, but to fit the code into a given amount of memory and make it execute as fast as possible. Compacting the code, that is, rewriting sections of the program to make them more space-efficient, is easy enough, but sometimes results in a loss of speed. These are tradeoffs to which the designer must be constantly attentive.

The processor executes approximately three instructions per millisecond, with each instruction taking the same amount of time. The designer takes into consideration the type of function and the necessary speed when writing the code. Straight-line code with no branches of any I kind executes the fastest. For the label search function on the HP-67 and the HP-97 the designer duplicated a great deal of code to; make it faster. The print instructions for the new HP-19C, on the other hand, did not need to be as fast to keep up with the printer, mechanism. So the designer compacted these codes, making them very complex.

```+---------------------------------------------------------------------+
|     +---------------+    +---------------+    +---------------+     |
|     |   A. Answer   |    |B. Multiplicand|    | C. Multiplier |     |
|     +---------------+    +---------------+    +---------------+     |
|                                                                     |
+--------+-----------+--------+---------------------------------------+
| LABEL  | CODE      |        | Explanation                           |
+--------+-----------+--------+---------------------------------------+
|        | A = 0     | W      | Clearing space for answer.            |
|        | P =       | 3      | Starting at least significant digit.  |
|        | GoTo      | Mpy90  | Starting in middle of loop for speed. |
| Mpy90  | A = A + B | W      | Add Multiplicand to partial product   |
|        |           |        |   the number of times specified by    |
|        |           |        |   digit being worked on.              |
+--------+-----------+--------+---------------------------------------+
| Mpy100 | C = C - 1 | P      |                                       |
|        | GoNC      | Mpy90  | NC stands for no carry, i.e. a GoTo   |
|        |           |        |   is executed unless a digit is       |
|        |           |        |   carried (or borrowed).              |
|        | ? P =     | 12     | Have we reached the end of the world? |
|        | GoYes     | END    |                                       |
|        |           |        |                                       |
+--------+-----------+--------+---------------------------------------+
|        | P = P + 1 |        | Move on the next digit.               |
|        | A SP      | W      | This does a divide by 10 to line up   |
|        |           |        |   for the next decade.                |
|        | GoTo      | Mpy100 | Go reenter loop.                      |
|        |           |        |                                       |
+--------+-----------+--------+---------------------------------------+
|  Figure 3. Multiplying two numbers uses the basic routine shown     |
|  above and involves three registers The starting point of this      |
|  code assumes the sign and exponent of the answer have already      |
|  been calculated. END is a common function return which would       |
|  perform operations such as display formatting, printing, etc.      |
+---------------------------------------------------------------------+
```

In general, programmable calculators go through more gyrations for each function than do preprogrammed calculators because they must generate an intermediate keycode. Where simple functions such as Change Sign, x exchange y, and ENTER require only 100 steps of code on a preprogrammed calculator, the same functions on a programmable calculator might take 150 steps. And a complex function such as rectangular-to-polar conversion might take over 4000 steps, depending on the argument keyed in.

The engineer uses a computer simulator to write and debug his code. Special programs written for this simulator furnish him with status bit information, register contents, and intermediate answers as an aid in this process.

Once the microcode is completed on the computer, it is transferred to an E-ROM (erasable read only-memory) simulator for further debugging (See photo.). The simulator is ideal for this because it is portable and easily updated as bugs are found and corrected. Simulated calculators are given to application engineers and quality assurance engineers to help locate problems.

After much editing, the microcode is ready to be converted into hardware. This usually takes several weeks. In the meantime, the simulators continue to be used heavily and, in most cases, additional bugs are found.

When the completed integrated circuit chips return, the first working models of the calculator are constructed. Final testing is then initiated. Some problems can only be discovered at this stage because of the peculiarities of simulated operation. For example, this is the first time low battery indicators can be checked since the E-ROM simulator does not work on batteries.

At long last, the revised code is ready to be sent for final chips. Although no problems are anticipated at this stage, testing continues to assure traditional Hewlett-Packard reliability.

The tremendous emphasis on testing of the calculators is for practical reasons as well incorrect programs cannot be easily corrected as they can be on large computers. Once the code is set in hardware, changes are costly and inconvenient.

When the final chips are approved for production, the development cycle is complete. The design engineer has spent anywhere from six months to 13 months perfecting his building block design. And whether the end product is a high-powered financial calculator like the HP-92 or a versatile keystroke programmable like the HP-19C, it is first an expression of his personality and creativity.

The Calculator Reference by Rick Furr (rfurr@vcalc.net)

Back to The HP Page

Back to The Calculator Reference

Division, part 1: Integer division

# DIVISION, PART 1: INTEGER DIVISION

Been a long time since I said anything here, and I and my Nippon HL-21 are parted for a few weeks, so I’ll talk algorithms for a while. Thus far we’ve seen addition, subtraction, and multiplication. The missing operation is division. Now, we’re going to start with “fourth-grade” division: the sort where you divide one whole number by another whole number and get a remainder. Later we’ll talk about decimals.

To start with, we might look at what integer division means, and how we do it by hand; the latter will inform our design of algorithms. Let’s suppose we want to divide, say, 52301 by 13.

http://www.pythondiary.com/blog/Oct.15,2014/building-cpu-simulator-python.html

## Wednesday, October15th, 2014

### Building a CPU simulator in Python

You can find the entire source listing from this article, plus more interesting code on my Python Experiments BitBucket page. Plus, the source code there will evolve and grow.

Early this morning in my Planet Python feed, I read this really interesting article called Developing Upwards: CARDIAC: The Cardboard Computer, which is about this cardboard computer called the Cardiac. As some of my followers and readers might know, I have another project that I have been working on and evolving for the past months called simple-cpu, which I have released the source code for. I really should give that project a proper license so that others might feel interested in using it in their own projects… Anyways, hopefully I’ll have that done by the time this publishes.

```class Cardiac(object):
"""
This class is the cardiac "CPU".
"""
def __init__(self):
self.init_cpu()
self.reset()
self.init_mem()
self.init_output()
def reset(self):
"""
This method resets the CPU's registers to their defaults.
"""
self.pc = 0 #: Program Counter
self.ir = 0 #: Instruction Register
self.acc = 0 #: Accumulator
self.running = False #: Are we running?
def init_cpu(self):
"""
This fancy method will automatically build a list of our opcodes into a hash.
This enables us to build a typical case/select system in Python and also keeps
things more DRY.  We could have also used the getattr during the process()
method before, and wrapped it around a try/except block, but that looks
a bit messy.  This keeps things clean and simple with a nice one-to-one
call-map.
"""
self.__opcodes = {}
classes = [self.__class__] #: This holds all the classes and base classes.
while classes:
cls = classes.pop() # Pop the classes stack and being
if cls.__bases__: # Does this class have any base classes?
classes = classes + list(cls.__bases__)
for name in dir(cls): # Lets iterate through the names.
if name[:7] == 'opcode_': # We only want opcodes here.
try:
opcode = int(name[7:])
except ValueError:
raise NameError('Opcodes must be numeric, invalid opcode: %s' % name[7:])
self.__opcodes.update({opcode:getattr(self, 'opcode_%s' % opcode)})
def init_mem(self):
"""
This method resets the Cardiac's memory space to all blank strings, as per Cardiac specs.
"""
self.mem = ['' for i in range(0,100)]
self.mem[0] = '001' #: The Cardiac bootstrap operation.
"""
This method initializes the input reader.
"""
self.reader = [] #: This variable can be accessed after initializing the class to provide input data.
def init_output(self):
"""
This method initializes the output deck/paper/printer/teletype/etc...
"""
self.output = []
```

Hopefully I left enough comments in the code above for you to understand exactly what’s going on here. You might notice here a key difference from my simple-cpuproject, is the method which opcodes are handled. I actually plan on incorporating this new way of detecting opcodes into that project in the near future. This method makes it easier for other developers of the library to extend it for their own requirements. As mentioned before, that project has been evolving as I started to understand more about how things should be done. In fact, taking on a CPU simulator project is a really incredible learning experience all on it’s own. If you are truly a computer scientist, you should understand the under workings of a CPU, and how each opcode is processed at it’s lowest level. Plus, developing and seeing a custom built CPU simulator made by your own imagination is a very gratifying experience. It’s almost like giving birth, as this machine was entirely built by your mind alone, and seeing it work, is something magical.

In the next part of this class, we will focus on the utility functions that we may need to call and use multiple times, these methods can also be overridden in subclasses to alter how the CPU actually functions:

```    def read_deck(self, fname):
"""
"""
def fetch(self):
"""
This method retrieves an instruction from memory address pointed to by the program pointer.
Then we increment the program pointer.
"""
self.ir = int(self.mem[self.pc])
self.pc +=1
def get_memint(self, data):
"""
Since our memory storage is *string* based, like the real Cardiac, we need
a reusable function to grab a integer from memory.  This method could be
overridden if say a new memory type was implemented, say an mmap one.
"""
return int(self.mem[data])
"""
This function pads either an integer or a number in string format with
zeros.  This is needed to replicate the exact behavior of the Cardiac.
"""
orig = int(data)
data = '%s%s' % (padding, abs(data))
if orig < 0:
return '-'+data[-length:]
return data[-length:]
```

These are the various utility functions, all of which might be overridden in a subclass. Later in this article, I will also provide an alternate source output which displays how this simulator can be built using Python Mixin classes, to make things even more pluggable. Finally, we find ourselves at the final part of code to get this simulator running, the actual processing methods:

```    def process(self):
"""
Process a single opcode from the current program counter.  This is
normally called from the running loop, but can also be called
manually to provide a "step-by-step" debugging interface, or
to slow down execution using time.sleep().  This is the method
that will also need to used if you build a TK/GTK/Qt/curses frontend
to control execution in another thread of operation.
"""
self.fetch()
opcode, data = int(math.floor(self.ir / 100)), self.ir % 100
self.__opcodes[opcode](data)
def opcode_0(self, data):
""" INPUT Operation """
def opcode_1(self, data):
""" Clear and Add Operation """
self.acc = self.get_memint(data)
def opcode_2(self, data):
self.acc += self.get_memint(data)
def opcode_3(self, data):
""" Test Accumulator contents Operation """
if self.acc < 0:
self.pc = data
def opcode_4(self, data):
""" Shift operation """
x,y = int(math.floor(data / 10)), int(data % 10)
for i in range(0,x):
self.acc = (self.acc * 10) % 10000
for i in range(0,y):
self.acc = int(math.floor(self.acc / 10))
def opcode_5(self, data):
""" Output operation """
self.output.append(self.mem[data])
def opcode_6(self, data):
""" Store operation """
def opcode_7(self, data):
""" Subtract Operation """
self.acc -= self.get_memint(data)
def opcode_8(self, data):
""" Unconditional Jump operation """
self.pc = data
def opcode_9(self, data):
""" Halt and Reset operation """
self.reset()
def run(self, pc=None):
""" Runs code in memory until halt/reset opcode. """
if pc:
self.pc = pc
self.running = True
while self.running:
self.process()
print "Output:\n%s" % '\n'.join(self.output)
self.init_output()

if __name__ == '__main__':
c = Cardiac()
try:
c.run()
except:
print "IR: %s\nPC: %s\nOutput: %s\n" % (c.ir, c.pc, '\n'.join(c.output))
raise
```

As mentioned above, I will now refactor the code into Mixin classes and provide a full source output of that:

```class Memory(object):
"""
This class controls the virtual memory space of the simulator.
"""
def init_mem(self):
"""
This method resets the Cardiac's memory space to all blank strings, as per Cardiac specs.
"""
self.mem = ['' for i in range(0,100)]
self.mem[0] = '001' #: The Cardiac bootstrap operation.
def get_memint(self, data):
"""
Since our memory storage is *string* based, like the real Cardiac, we need
a reusable function to grab a integer from memory.  This method could be
overridden if say a new memory type was implemented, say an mmap one.
"""
return int(self.mem[data])
"""
This function pads either an integer or a number in string format with
zeros.  This is needed to replicate the exact behavior of the Cardiac.
"""
orig = int(data)
data = '%s%s' % (padding, abs(data))
if orig < 0:
return '-'+data[-length:]
return data[-length:]

class IO(object):
"""
This class controls the virtual I/O of the simulator.
To enable alternate methods of input and output, swap this.
"""
"""
This method initializes the input reader.
"""
self.reader = [] #: This variable can be accessed after initializing the class to provide input data.
def init_output(self):
"""
This method initializes the output deck/paper/printer/teletype/etc...
"""
self.output = []
"""
"""
def format_output(self):
"""
This method is to format the output of this virtual IO device.
"""
return '\n'.join(self.output)
def get_input(self):
"""
This method is used to get input from this IO device, this could say
be replaced with raw_input() to manually enter in data.
"""
try:
except IndexError:
# Fall back to raw_input() in the case of EOF on the reader.
return raw_input('INP: ')[:3]
def stdout(self, data):
self.output.append(data)

class CPU(object):
"""
This class is the cardiac "CPU".
"""
def __init__(self):
self.init_cpu()
self.reset()
try:
self.init_mem()
except AttributeError:
raise NotImplementedError('You need to Mixin a memory-enabled class.')
try:
self.init_output()
except AttributeError:
raise NotImplementedError('You need to Mixin a IO-enabled class.')
def reset(self):
"""
This method resets the CPU's registers to their defaults.
"""
self.pc = 0 #: Program Counter
self.ir = 0 #: Instruction Register
self.acc = 0 #: Accumulator
self.running = False #: Are we running?
def init_cpu(self):
"""
This fancy method will automatically build a list of our opcodes into a hash.
This enables us to build a typical case/select system in Python and also keeps
things more DRY.  We could have also used the getattr during the process()
method before, and wrapped it around a try/except block, but that looks
a bit messy.  This keeps things clean and simple with a nice one-to-one
call-map.
"""
self.__opcodes = {}
classes = [self.__class__] #: This holds all the classes and base classes.
while classes:
cls = classes.pop() # Pop the classes stack and being
if cls.__bases__: # Does this class have any base classes?
classes = classes + list(cls.__bases__)
for name in dir(cls): # Lets iterate through the names.
if name[:7] == 'opcode_': # We only want opcodes here.
try:
opcode = int(name[7:])
except ValueError:
raise NameError('Opcodes must be numeric, invalid opcode: %s' % name[7:])
self.__opcodes.update({opcode:getattr(self, 'opcode_%s' % opcode)})
def fetch(self):
"""
This method retrieves an instruction from memory address pointed to by the program pointer.
Then we increment the program pointer.
"""
self.ir = self.get_memint(self.pc)
self.pc +=1
def process(self):
"""
Process a single opcode from the current program counter.  This is
normally called from the running loop, but can also be called
manually to provide a "step-by-step" debugging interface, or
to slow down execution using time.sleep().  This is the method
that will also need to used if you build a TK/GTK/Qt/curses frontend
to control execution in another thread of operation.
"""
self.fetch()
opcode, data = int(math.floor(self.ir / 100)), self.ir % 100
self.__opcodes[opcode](data)
def opcode_0(self, data):
""" INPUT Operation """
self.mem[data] = self.get_input()
def opcode_1(self, data):
""" Clear and Add Operation """
self.acc = self.get_memint(data)
def opcode_2(self, data):
self.acc += self.get_memint(data)
def opcode_3(self, data):
""" Test Accumulator contents Operation """
if self.acc < 0:
self.pc = data
def opcode_4(self, data):
""" Shift operation """
x,y = int(math.floor(data / 10)), int(data % 10)
for i in range(0,x):
self.acc = (self.acc * 10) % 10000
for i in range(0,y):
self.acc = int(math.floor(self.acc / 10))
def opcode_5(self, data):
""" Output operation """
self.stdout(self.mem[data])
def opcode_6(self, data):
""" Store operation """
def opcode_7(self, data):
""" Subtract Operation """
self.acc -= self.get_memint(data)
def opcode_8(self, data):
""" Unconditional Jump operation """
self.pc = data
def opcode_9(self, data):
""" Halt and Reset operation """
self.reset()
def run(self, pc=None):
""" Runs code in memory until halt/reset opcode. """
if pc:
self.pc = pc
self.running = True
while self.running:
self.process()
print "Output:\n%s" % self.format_output()
self.init_output()

class Cardiac(CPU, Memory, IO):
pass

if __name__ == '__main__':
c = Cardiac()
try:
c.run()
except:
print "IR: %s\nPC: %s\nOutput: %s\n" % (c.ir, c.pc, c.format_output())
raise
```

You can find the code for deck1.txt from the Developing Upwards: CARDIAC: The Cardboard Computer article, I used the counting to 10 example.

Hopefully this article was inspiring to you and gives you a brief understanding on how to built class-bases modular and pluggable code in Python, and also gave you a nice introduction to developing a CPU simulator. In the next article which I hope to publish soon, will guide you through making a basic assembler for this CPU, so that you can easily and effortlessly create decks to play around with in the simulator.

Comment #1: Posted 4 years, 6 months ago by Calvin Spealman

This is such cool work! The simplicity of the CARDIAC is certainly something to admire. If you don’t mind I’ll probably follow up with a nod towards this. Are you planning to put the code up at a proper repo anywhere?

Comment #2: Posted 4 years, 6 months ago by Kevin Veroneau

@Calvin, the code is up in a repo, look inside the information tooltip box at the very top of this article for a click-able link to the BitBucket page.

Names Kevin, hugely into UNIX technologies, not just Linux. I’ve dabbled with the demons, played with the Sun, and now with the Penguins.
Kevin Veroneau Consulting Services
Do you require the services of a Django contractor? Do you need both a website and hosting services? Perhaps I can help.

## This Month

If you like what you read, please consider donating to help with hosting costs, and to fund future books to review.

Python Powered | © 2012-2019 Kevin Veroneau

https://www.cs.drexel.edu/~bls96/museum/cardiac.html

## Background

The acronym CARDIAC stands for “CARDboard Illustrative Aid to Computation.” It was developed by David Hagelbarger at Bell Labs as a tool for teaching how computers work in a time when access to real computers was extremely limited. The CARDIAC kit consists of a folded cardboard “computer” and an instruction manual. In July 1969, the Bell Laboratories Record contained an article describing the system and the materials being made available to teachers for working with it.

As illustrated in the following pictures, the CARDIAC computer consisted of a left-hand CPU section and a right-hand memory section. On the CPU side there are five sliders:

• One slider of input “cards”
• One slider for the accumulator sign
• Three sliders for the digits of an instruction

The memory side has a single slider of output “cards.” Portions of the sliders show through cutouts in the card frame. The cutouts for the input and output card sliders each show the current card to be read or written. The combination of the accumulator sign and the three instruction sliders show steps through cutouts that describe the operation of the selected instruction. Effectively, the sliders and cutouts are the instruction decoder of the CPU. Finally, each memory location has a hole in it. A small carboard ladybug serves as the program counter which is moved from location to location in response to the steps described on the CPU side.

The CARDIAC manual is 50+ pages divided into 16 sections describing the basics of computers from a late 1960s perspective. The first six sections cover things like flow charts, instructions, data, addresses, and the stored program concept. Sections 7–12 discuss the CARDIAC and some basic programming techniques including loops and multiplication. Sections 13 and 14 discuss the techniques for bootstrapping and subroutines, both of which we elaborate on below. Section 15 focuses on the development of a program to play NIM. Finally, Section 16 discusses assemblers and compilers. Although there is some duplication of information, the material on this page is not intended to replace the manual. Rather, the material here expands on that in the manual, particularly from the point of view of one who is already familiar with the most basic concepts of computers and programming.

### Pictures

Click on these pictures for larger versions.

## CARDIAC Architecture

### Memory

The CARDIAC has a grand total of 100 memory locations identified by the two-digit decimal numbers 00 through 99. Each memory location holds a signed three-digit decimal numer. (With the exception of a single code example, the CARDIAC book is actually silent on whether memory contains signed or unsigned values.) Locations 00 and 99 are special. Location 00 always contains the value 001, which as we see below is the instruction to read a card into location 01. This special value is used the the bootstrapping process discussed later. Location 99 always contains a value between 800 and 899. The tens and ones digits of the number are the value of the program counter after a jump instruction is executed. This provides the mechanism for a return from subroutine.

### CPU

The CARDIAC CPU is a single-accumulator single-address machine. Thus each instruction operates optionally on a single memory location and the accumulator. For example, the ADD instruction reads the data in one memory location, adds it to the current value of the accumulator and stores the result back into the accumulator. The ALU supports addition, subtraction, and decimal shifting. CARDIAC’s CPU architecture is illustrated in the following figure:

The CARDIAC accumulator holds a signed 4-digit number, which seems odd given that everything else is oriented around 3-digit numbers. The manual includes the statement:

Since CARDIAC’s memory can store only 3-digit numbers, you may be puzzled by the inclusion of an extra square in the accumulator. It is there to handle the overflow that will result when two 3-digit numbers whose sum exceeds 999 are added.

What’s not clear is under what conditions that overflow/carry digit is kept or discarded. From the discussion of the SFT instruction in Section 12 of the manual, exactly four digits are kept for the intermediate value between the left and right shift operations. However, the manual doesn’t state whether all four digits are kept between instructions nor what happens when storing the accumulator to memory if the accumulator contains a number whose magnitude is greater than 999. In the case of our simulator, we retain all four digits, effectively implementing a 4-digit ALU. However, when storing the accumulator to memory, we discard the fourth digit. I.e. the number stored in memory is a mod 1000, where a is the contents of the accumulator.

### I/O

The CARDIAC has exactly one input device and one output device. These are a card reader and a card punch. Unlike real punch cards, the CARDIAC input and output cards can each hold exactly one signed three-digit number. When a card is read by way of the INP instruction, it is taken into the card reader and removed from the stack of cards to be read. Similarly, on each OUT instruction, a new card is “punched” with the specified value on it, and the card moved to the output card stack.

## Instruction Set

The CARDIAC’s instuction set has only 10 instructions, each identified by an operation code (opcode) of 0 through 9. The instructions are as follows:

Opcode Mnemonic Operation
0 INP Read a card into memory
3 TAC Test accumulator and jump if negative
4 SFT Shift accumulator
5 OUT Write memory location to output card
6 STO Store accumulator to memory
7 SUB Subtract memory from accumulator
8 JMP Jump and save PC
9 HRS Halt and reset

### Encoding

All instructions are non-negative numbers expressed as three-digit decimal numerals. The CARDIAC manual doesn’t describe what happens if an attempt is made to execute a negative instruction. In our simulator, we treat negative instructions as no-ops (i.e. they are ignored and the program continues on to the next instruction). The operation code is the most significant of those three digits, i.e., o=⌊i /100⌋, where i is the contents of the instruction register (IR) loaded from the memory location specified by the PC. For most instructions, the lower-order digits are the address of the operand, i.e. a=i mod 100. This arrangement is illustrated in the following figure.

In the cases of the INP and STO instructions, a is the destination address for the data coming from either an input card or the accumulator, respectively. In the cases of the CLA, ADD, OUT, and SUB instructions, a is the source address of the second operand to the ALU or the source address of the operand being written to an output card. For the TAC, JMP, and HRS instructions, a is the address to be loaded into the PC (conditionally, in the case of the TAC instruction). The remaining instruction, SFT, doesn’t treat the lower-order digits as an address. Instead, each of the lower-order digits is a number of digit positions to shift first left, then right. The left shift count is given by l=⌊a /10⌋, and the right shift count is given by r=a mod 10. The instruction format for the SFT instruction is shown in the following figure:

### Instruction Execution

The instructions operate as described here. In this discussion, we use the following notation:

Notation Meaning
ACC Contents of the accumulator
PC Contents of the program counter
a Operand address as described in the previous subsection
MEM[x] Contents of memory location x
INPUT Contents of one card read from the input
OUTPUT Contents of one card written to the output
INP
The INP instruction reads a single card from the input and stores the contents of that card into the memory location identified by the operand address. (MEM[a] ← INPUT)
CLA
This instruction causes the contents of the memory location specified by the operand address to be loaded into the accumulator. (ACC ← MEM[a])
The ADD instruction takes the contents of the accumulator, adds it to the contents of the memory location identified by the operand address and stores the sum into the accumulator. (ACC ← ACC + MEM[a])
TAC
The TAC instruction is the CARDIAC’s only conditional branch instruction. It tests the accumulator, and if the accumulator is negative, then the PC is loaded with the operand address. Otherwise, the PC is not modified and the program continues with the instruction following the TAC. (If ACC < 0, PC ← a)
SFT
This instruction causes the accumulator to be shifted to the left by some number of digits and then back to the right some number of digits. The amounts by which it is shifted are shown above in the encoding for the SFT instruction. (ACC ← (ACC × 10^l) / 10^r)
OUT
The OUT instruction takes the contents of the memory location specified by the operand address and writes them out to an output card. (OUTPUT ← MEM[a])
STO
This is the inverse of the CLA isntruction. The accumulator is copied to the memory location given by the operand address. (MEM[a] ← ACC)
SUB
In the SUB instruction the contents of the memory location identified by the operand address is subtracted from the contents of the accumulator and the difference is stored in the accumulator. (ACC ← ACC − MEM[a])
JMP
The JMP instruction first copies the PC into the operand part of the instruction at address 99. So if the CARDIAC is executing a JMP instruction stored in memory location 42, then the value 843 will be stored in location 99. Then the operand address is copied into the PC, causing the next instruction to be executed to be the one at the operand address. (MEM[99] ← 800 + PC; PC ← a)
HRS
The HRS instruction halts the CARDIAC and puts the operand address into the PC. (PC ← a; HALT)

### Assembly Language

All of the code fragments and complete program examples on this page are shown in an assembly language format with each line organized into six columns:

1. Address: The first column shows the memory address respresented by that line.
2. Contents: In the second column, we put the number that is stored in that memory location.
3. In most cases, this is a instruction, but for lines with a DATA pseudo-op, it is a data value.

4. Label: The third column contains an optional label on the memory location, allowing it to be identified by name, rather than by address.
5. Opcode: Instruction mnemonics are places in the fourth column. In addition to the ten instructions discussed above, we will use on pseudo-op (or assembler directive), DATA. For memory locations containing a DATA item, the operand is the literal data value stored in the memory location, rather than an operand for an instruction. This pseudo-op is particularly useful when labeled for creating variables.
6. Operand: The fifth column is the operand part of the instruction or the literal data for a DATA directive. Numerical operands are included directly in the address field of the instruction. When a label name appears as an oeprand, the memory address associated with that label is placed in the address field of the instruction.
7. Comment: Any desired descriptive text can be placed after the operand.

## Indirection, Indexing, and Pointers

Notice that the only way of specifying the address of a memory location we want to use is in the instruction itself. Most comptuer architectures provide a mechanism whereby the address we want to use can be stored in a register or another memory location. Variables which contains memory addresses are usually referred to as pointers.

Even though the CARDIAC doesn’t have hardware support for using pointers directly, we can still do simple indirect addressing. Suppose we have a variable stored in a memory location called ptr and it has the value 42 in it. Now if we want to load the accumulator with the contents of memory location 42, we can do something like:

```05	100	loader	DATA	100
06	042	ptr	DATA	042

```

Notice that even though we have specified that we will load from location 00 in the instruction at location 23, we will have changed it to load from location 42 by the time we run execute that instruction. For that matter, it doesn’t matter if we’ve loaded anything into location 23 before starting this sequence. It will get set before we use it.

### Indirect Stores

Storing the accumulator to a memory location identified by a pointer is similar. We just have to be careful not to lose the value we want to store while we’re fiddling about with the store instruction and in the following bit of code:

```05	600	storer	DATA	600
06	042	ptr	DATA	042
07	000	acc	DATA	000

20	607		STO	acc
21	105		CLA	storer
23	625		STO	indstor
24	107		CLA	acc
25	600	indstor	STO	00
```

### Array Indexing

Often we aren’t so much interested in a pointer that identifies a single memory location as we are in an array of memory locations we can refer to by index. We will identify our array locations starting at index 0. So the first element of the array is at index 0, the second at index 1, and so on. If we have a variable called base that holds the first address of the array, then we can just add the base and the index together to get the address of a particular element. This is just a slight modification of the indirect accesses above. In particular, to load from an array element:

```05	100	loader	DATA	100
06	042	base	DATA	042
07	000	index	DATA	000

```

and for storing to an array element:

```05	600	storer	DATA	600
06	042	base	DATA	042
07	000	index	DATA	000
08	000	acc	DATA	000

20	608		STO	acc
21	105		CLA	storer
24	626		STO	arrstor
25	108		CLA	acc
26	600	arrstor	STO	00
```

If we’re dealing with only one array, we could eliminate one add instruction from each sequence by pre-adding the base and loader and pre-adding the base and storer.

## Stacks

Another use of indirect address is the stack data structure. If you’re not familiar with a stack, think of it like a stack of plates in a cafateria. A plate is always placed on top of the stack. Likewise, the one removed is always the one on the top of the stack. We refer to the process of putting an element onto a stack as pushing and the process of taking an element off of a stack as popping.Note that we always pop that most recently pushed element. Because of this, the stack is often referred to as a last-in, first-out (LIFO) data structure. Pushing and popping are very similar to storing and loading indirectly, except that we must also adjust the value of the pointer that identifies the top of the stack. In the following code we’ll use a memory location named tos (for top-of-stack) for the pointer. Also, we’ll do as is often done in hardware stacks and let the stack grow downward. That is to say, as we push data onto the stack, the stack pointer moves toward lower memory addresses. With that in mind, here is a fragment of code for pushing the accumulator onto the stack:

```05	600	storer	DATA	600
07	089	tos	DATA	089
08	000	acc	DATA	000

20	608		STO	acc
21	107		CLA	tos
23	628		STO	stapsh
24	107		CLA	tos
25	700		SUB	00
26	607		STO	tos
27	108		CLA	acc
28	600	stapsh	STO	00
```

And similarly to pop from the top of the stack:

```20	107		CLA	tos
22	607		STO	tos
24	625		STO	stapop
25	100	stapop	CLA	00
```

These code fragments (slightly modified) are used in the example below that uses the LIFO properties of the stack to reverse the order of a list of numbers on the input cards.

## Subroutines

There are many reasons why we might wish to subdivide a program into a number of smaller parts. In the context of higher level languages and methodologies, these subdivisions are often referred to by names like procedures, functions, and methods. All of these are types of subroutines, the name we usually use when working at the hardware or machine language level. In these sections, we look at the techniques for creating and using subroutines on the CARDIAC. Each subsection progressively builds from the simplest subroutine technique to more complex and advanced techiques. Don’t worry if not all of it makes sense on a first reading. You can get a good sense of the general idea of subroutines without necessarily understanding the details of how recursion is implemented on a machine as limited as the CARDIAC.

### Single Simple Subroutines

In the CARDIAC, the JMP instruction is effectively a jump-to-subroutine instruction, storing the return address in location 99. Because the address stored in location 99 is prefixed by the opcode 8, the instruction in that location becomes a return-from-subroutine instruction. Thus any segment of code whose last instruction is at location 99 can be called as a subroutine, simply by jumping to its first instruction. For example, a simple routine to double the value of the accumulator could be coded as:

```96	000	accval	DATA	000

97	696	double	STO	accval
99	800		JMP	00
```

```	897		JMP	double
```

### Multiple Subroutines

Clearly, if our subroutine executes a JMP instruction or if it calls another subroutine, then we will lose our return address, because it will be overwritten by the JMP instruction. Along similar lines, if we have more than one subroutine in our program, only one of them can be at the end of the memory space and flow directly into location 99.

As a result, in many cases, we’ll need a more involved subroutine linkage mechanism. One way to accomplish this is to save the return address somewhere and restore it when needed. If we use this method, we’ll have to devise a mechansism to transfer control to location 99 with the right return address. Although location 99 can itself be used as the return from subroutine instruction, it doesn’t have to be. In many cases, it will be easier to copy it to the end of our actual subroutine. Using this approach, we can write a subroutine that outputs the value of the accumulator as follows:

```80	686	aprint	STO	86
81	199		CLA	99
82	685		STO	aexit
83	586		OUT	86
84	186		CLA	86
85	800	aexit	JMP	00
```

Similarly, our doubling routine would look like:

```90	696	double	STO	96
91	199		CLA	99
92	695		STO	dexit
93	196		CLA	96
95	800	dexit	JMP	00
```

See below for an example of a program that uses these subroutines to produce a list of powers of two.

### Recursion

There’s one more limitation on subroutines still in the techniques we have developed. What happens if a subroutine calls itself? You might reasonably as, is it even useful for a function call itself? The answer is, yes, and it called recursion.

The key to making it possible for a subroutine to call itself is to realize that no matter where we’re called from, we always want to return to the place from which we were most recently called that we haven’t already returned to. That should sound familiar. We should use the return addresses in the same LIFO order that a stack provides. In other words, when we call a recursive subroutine, we want to push the return address onto a stack and then pop it back off when we return from that subroutine. With a little reflection, we can see that this approach applies to all subroutine calls, not just to those that are recursive. This is why pushing return addresses on a stack is the basis for hardware subroutine call support in most architectures since about the 1970s on.

On the CARDIAC, we can implement this technique with a modification of the multiple subroutine technique above. When entering a subroutine, rather than copying location 99 to the return from subroutine instruction, we push the contents of location 99 onto the stack. Then when we’re about to return from the subroutine, we pop the return address off the stack into the return from subroutine instruction. So our code would look something like:

```	1xx		CLA	tos
6zz		STO	stapsh
1xx		CLA	tos
700		SUB	00
6xx		STO	tos
199		CLA	99
zz	600	stapsh	STO	00
.
.		body of the subroutine
.
1xx		CLA	tos
6xx		STO	tos
6ss		STO	stapop
ss	100	stapop	CLA	00
6rr		STO	rts
rr	800	rts	JMP	00
```

There’s one more aspect of recursive subroutines that is also suitable for other subroutines as well. In particular, subroutines often need input data passed to them by whatever code has called them or temporary variables that are needed during the course of their operation. If a subroutine is not recursive, we can get away with just allocating some fixed memory locations for these. However, in the case of recursive subroutines, we need to make sure that we have fresh ones for each time the subroutine is called and not overwrite the ones that might still be needed by other instances we might return back to. The most natural way to handle this is to allocate them on the stack along with the return address.

Putting all these things together, we can summarize the steps for calling a subroutine in the most general cases:

1. Before calling the subroutine, we push any inputs (also called arguments or parameters) onto the stack.
2. Transfer control to the first instruction of the subroutine, saving the PC (which holds the return address) in the process.
3. If the hardware has not already saved the PC onto the stack, the first thing we do in the subroutine is copy it to the stack.
4. Move the stack pointer to resever space on the stack for any temporary (local) variables the subroutine will need.
5. Before returning, the subroutine readjusts the stack pointer to remove the temporary variables it allocated.
6. If the hardware does not already expect the return address to be on the stack, we need to pop it off the stack and copy it back to where it does need to be.
7. Return control from the subroutine back to the code that called it.
8. Finally, the calling code adjusts the stack pointer to remove the arguments it pushed onto the stack before calling the subroutine.

## Bootstrapping

Like many of the early system designs, the mechanism for loading an initial program into the CARDIAC and getting it running involves a small amount of hardware support and a lot of cleverness. The whole enterprise is often somewhat remenescent of the image of a person attempting to lift themselves off the ground by pulling on their own bootstraps. This is why we usually refrer to the process as bootstrapping or often just booting.

If this is all we did, we’d read all the remaining cards into memory and then the computer would halt when there were no more cards to read. But there’s another trick we can play. If we make the last address-data card pair change location 02 from 800 to a jump to the first instruction of our program, the loader loop will stop and control will transfer to the program we just loaded. So after all of our address-data card pairs, we’ll append the cards 002 and 8xx where xx is the address of the first instruction of our program. The net effect is that we can now load a program and start running it without any manual intervention.

The last piece of this puzzle is how do we include the data we want the program to operate on? It turns out, that’s a simple as just appending the data after the 002 and 8xx cards. When control transfers to the program we loaded, any remaining cards will still be in the reader waiting to be read. When the program executes its first INP instruction, it will happily read the next card, not knowing that there were a bunch of other cards read ahead of it.

So putting all the pieces together, we bootstrap the CARDIAC by putting together a card deck that looks like:

```002
800
.
.
002
8xx	where xx is address of the first instruction
.
.	data cards
.
```

Then we put that deck into the card reader, and start the computer at address 00. The CARDIAC will first load the two-card bootstrap loader, then load the program into memory, then transfer control to the newly loaded program. If the program itself also includes INP instructions, they read the remaining data cards.

## Simulator

We have developed a CARDIAC simulator suitable for running the code discussed on this page. All of the examples in the next section have been tested using this simulator.

To avoid any unnecessary requirements on screen layout, the simulator is laid out a little differently than the physical CARDIAC. At the top of the screen is the CARDIAC logo from a photograph of the actual unit. This picture is also a link back to this page. The next section of the screen is the CARDIAC memory space as appears on the right hand side of the physical device. When the simulator starts up, the value 001 in location 00 and the value 8– in location 99 are preloaded. As a simplification, we don’t use a picture of a ladybug for the program counter, but instead highlight the memory location to which the PC points with a light green background. Each memory location is editable (including the ones that are intended to be fixed), and the tab key moves focus down each column in memory address order.

The bottom section of the simulator is the I/O and CPU. Input is divided into two text areas. The first is the card deck and is editable. The second area is the card reader, and as cards are consumed by the reader they are removed from the listing in the reader. Cards in the deck are loaded into the reader with the Load button. Output cards appear in the Output text area as they are generated with the OUT instruction.

The CPU section of the simulator has four parts showing the status of the CPU and buttons for control. On the top of the CPU section, the Program Counter is shown in an editable text box. Below that is the instruction decoder with non-editable text boxes showing the contents of the Instruction Register and a breakdown of the instruction decoding in the form of an opcode mnemonic and numeric operand. The Accumulator is shown below the instuction decoder. Below the register display are six buttons that control the operation of the simulator:

Reset
The Reset button clears the instruction register, resets the PC and accumulators to 0 and clears the output card deck.
Clear Mem
This button resets all memory locations to blank and re-initializes location 00 to 001 and location 99 to 8–.
Step
Clicking on the Step button causes the simulator to execute the single instruction highlighted in the memory space as pointed to by the program counter. Upon completion of the instruction, the screen is updated to show the state of the computer after the instruction.
Slow
The Slow button causes the simulator to begin executing code starting at the current PC. Instructions are executed at the rate of 10 per second with the screen being updated after each instruction. When the program is run in this way, the movement of the highlighted memory shows the flow of control in the program very clearly.
Run
In the current version of the simulator, the Run button causes the program to be executed beginning from the current PC at the full speed of the JavaScript interpreter. Because of the way JavaScript is typically implemented, the screen contents will not show the effects of code execution until the simulator executes the HRS instruction and the program halts.
Halt
Pressing the Halt button while the program is running in slow mode causes the simulator to stop after the current instruction. The state of the machine remains intact and can be continued with any of the StepSlow, or Run buttons.

## Examples

The remainder of this page are a number of examples of programs written for the CARDIAC. They have all been tested using the simulator described above. Because the memory space of the CARDIAC is so limited, none of the programs are particularly complex. You won’t find a compiler, operating system, or web browser here. However, we do have a few of more complexity than you might expect. There’s a pretty simple program for generating a list of the powers of 2. There’s one that recursively solves the Towers of Hanoi problem. For each of them, we include the assembly language source code with assembled machine language code and a card deck suitable for bootstrapping on the CARDIAC.

Note that most of these examples aren’t the most compact way of solving the problem. Rather, they illustrate techniques as described through this page. The primary exception is the Towers of Hanoi solution which requried some effort to squeeze it into the limited memory space of the CARDIAC.

When we take these programs and turn them into decks of cards to be bootstrapped on the CARDIAC, we get the card decks listed below the program listings. If you cut and paste the list into the input deck of the simulator, hit load, and hit slow, you can see the program get loaded into memory and run.

### Count from 1 to 10

This is sort of our CARDIAC version of “Hello World.” Our objective is simply to print out a set of output cards with the values 1 to 10. We keep two variables to control the process. One, called n keeps track of how many cards we still have left to print. At any point in time it represents that we need to print n+1 more cards. We also have a variable called cntr wich is the number to print out. Each time through the loop, we check to see if n is negative and if so, we’re done. If not, we decrement it, print cntr and then increment cntr.

#### Program Listing

```04	009	n	DATA	009
05	000	cntr	DATA	000

10	100		CLA	00	Initialize the counter
11	605		STO	cntr
12	104	loop	CLA	n	If n < 0, exit
13	322		TAC	exit
14	505		OUT	cntr	Output a card
15	105		CLA	cntr	Increment the card
17	605		STO	cntr
18	104		CLA	n	Decrement n
19	700		SUB	00
20	604		STO	n
21	812		JMP	loop
22	900	exit	HRS	00
```

```002
800
010
100
011
605
012
104
013
322
014
505
015
105
016
200
017
605
018
104
019
700
020
604
021
812
022
900
004
009
002
810
```

### List Reversal

Our next example uses the stack techniques described above to take in a list of cards and output the same list in reverse order. The first card in the input deck (after the bootstrapping and the program code) is the count of how many cards we’re operating on. The remainder of the input deck are the cards to reverse. In the example card deck, we are reversing the first seven Fibonacci numbers.

#### Program Listing

```04	600	storer	DATA	600
06	089	tos	DATA	089	Stack pointer
07	000	acc	DATA	000	Temp for saving accumulator
08	000	n1	DATA	000	Write counter
09	000	n2	DATA	000	Read counter

10	008		IN	n1	Get the number of cards to reverse
11	108		CLA	n1	Initialize a counter
12	609		STO	n2
13	109	rdlp	CLA	n2	Check to see if there are any more cards to read
14	700		SUB	00
15	327		TAC	wrlp
16	609		STO	n2
17	007		IN	acc	Read a card
18	106		CLA	tos	Push it onto the stack
20	625		STO	stapsh
21	106		CLA	tos
22	700		SUB	00
23	606		STO	tos
24	107		CLA	acc
25	600	stapsh	STO	00
26	813		JMP	rdlp
27	108	wrlp	CLA	n1	Check to see if there are any more cards to write
28	700		SUB	00
29	339		TAC	done
30	608		STO	n1
31	106		CLA	tos	Pop a card off the stack
33	606		STO	tos
35	636		STO	stapop
36	100	stapop	CLA	00
37	890		JMP	aprint	Output a card
38	827		JMP	wrlp
39	900	done	HRS	00

90	696	aprint	STO	96	Write a card containing the contents of the accumulator
91	199		CLA	99
92	695		STO	aexit
93	596		OUT	96
94	196		CLA	96
95	800	aexit	JMP	00
```

```002
800
004
600
005
100
006
089
007
000
008
000
009
000
010
008
011
108
012
609
013
109
014
700
015
327
016
609
017
007
018
106
019
204
020
625
021
106
022
700
023
606
024
107
025
600
026
813
027
108
028
700
029
339
030
608
031
106
032
200
033
606
034
205
035
636
036
100
037
890
038
827
039
900
090
696
091
199
092
695
093
596
094
196
095
800
002
810
007
001
001
002
003
005
008
013
```

### Powers of 2

This is a slightly more interesting version of the list from 1 to 10. In this case, we are printing the powers of 2 from 0 to 9. The main difference is that instead of incrementing the number to output, we call a subroutine that doubles it. The program illustrates the use of multiple subroutines as discussed above.

#### Program Listing

```04	000	n	DATA	000
05	009	cntr	DATA	009

10	100		CLA	00	Initialize the power variable with 2^0
11	880		JMP	aprint
12	604	loop	STO	n
13	105		CLA	cntr	Decrement the counter
14	700		SUB	00
15	321		TAC	exit	Are we done yet?
16	605		STO	cntr
17	104		CLA	n
18	890		JMP	double	Double the power variable
19	880		JMP	aprint	Print it
20	812		JMP	loop
21	900	exit	HRS	00

80	686	aprint	STO	86	Print a card with the contents of the accumulator
81	199		CLA	99
82	685		STO	aexit
83	586		OUT	86
84	186		CLA	86
85	800	aexit	JMP	00

90	696	double	STO	96	Double the contents of the accumulator
91	199		CLA	99
92	695		STO	dexit
93	196		CLA	96
95	800	dexit	JMP	00
```

```002
800
005
009
010
100
011
880
012
604
013
105
014
700
015
321
016
605
017
104
018
890
019
880
020
812
021
900
080
686
081
199
082
685
083
586
084
186
090
696
091
199
092
695
093
196
094
296
002
810
```

### Towers of Hanoi

By far the most complex example we include is a solution to the Towers of Hanoi problem. The puzzle consists of three posts on which disks can be placed. We begin with a tower of disks on one post with each disk smaller than the one below it. The other two posts are empty. The objective is to move all of the disks from one post to another subject to the following rules:

1. Only one disk at a time may be moved.
2. No disk may be placed on top of a smaller disk.

According to legend, there is a set of 64 disks which a group of monks are responsible for moving from one post to another. When the puzzle with 64 disks is finally solved, the world will end.

Although the puzzle sounds like it would be difficult to solve, it’s very easy if we think recursively. Moving n disks from Post a to Post b using Post c as a spare can be done as follows:

1. Move n−1 disks from Post a to Post c.
2. Move one disk from Post a to Post b.
3. Move n−1 disks from Post c to Post b.

The CARDIAC doesn’t have enough memory to solve a 64-disk puzzle, but we can solve smaller instances of the problem. In particular, the program we show here can solve up to six disks. The actual number of disks to solve is given by the first data card, and the initial assignment of source destination and spare posts is given on the second data card. The post assignments as well as the output encoding are shown in the following table.

Output Disk Move
000 1 → 3
001 2 → 3
002 3 → 2
003 3 → 1
004 2 → 1
005 1 → 2

For example, the post assignments indicated by a card with the value 3 are that Post 3 is a, Post 2 is c and Post 1 is b. Similarly, an output card with 3 indicates that we are to move a disk from Post 3 to Post 1.

Before trying to understand the details of this program, note that there are several tricks used to reduce the memory usage. The amount of memory available for the stack allows for a puzzle of up to six disks to be solved with this program. Be aware, however, that slow running this program on six disks takes the better part of a half hour to run.

#### Program Listing

```03	031	tos	DATA	031
05	600	storer	DATA	600
06	107	r2ld	DATA	r2
07	001	r2	DATA	001
08	000		DATA	000
09	005	five	DATA	005
10	004		DATA	004
11	003	three	DATA	003
12	002		DATA	002

34	033		INP	32	Get the number of disks from the cards
35	032		INP	31	Get the column ordering from the cards
36	838		JMP	tower	Call the tower solver
37	900		HRS

38	199	tower	CLA	99	Push the return address on the stack
39	890		JMP	push
40	111		CLA	three	Fetch n from the stack
41	870		JMP	stkref
42	700		SUB	00	Check for n=0
43	366		TAC	towdone
44	890		JMP	push	Push n-1 for a recursive call
45	111		CLA	three	Get the first recursive order
46	870		JMP	stkref
47	669		STO	t1
48	109		CLA	five
49	769		SUB	t1
50	890		JMP	push
51	838		JMP	tower	Make first recursive call
52	880		JMP	pop
53	111		CLA	three	Get move to output
54	870		JMP	stkref
55	669		STO	t1
56	569		OUT	t1
57	111		CLA	three	Get second recursive order
58	870		JMP	stkref
60	661		STO	t2
61	100	t2	CLA	00
62	890		JMP	push
63	838		JMP	tower	Make second recursive call
64	880		JMP	pop
65	880		JMP	pop
66	880	towdone	JMP	pop
67	668		STO	towret
68	800	towret	JMP	00

70	679	stkref	STO	refsav	Replace the accumulator with the contents
71	199		CLA	99	of the stack indexed by the accumulator
72	678		STO	refret
73	179		CLA	refsav
76	677		STO	ref
77	100	ref	CLA	00
78	800	refret	JMP	00

80	199	pop	CLA	99	Pop the stack into the accumulator
81	688		STO	popret
82	103		CLA	tos
84	603		STO	tos
86	687		STO	popa
87	100	popa	CLA	00
88	800	popret	JMP	00

90	689	push	STO	pshsav	Push the accumulator on to the stack
91	103		CLA	tos
93	698		STO	psha
94	103		CLA	tos
95	700		SUB	00
96	603		STO	tos
97	189		CLA	pshsav
98	600	psha	STO	00
```

```002
800
003
031
004
100
005
600
006
107
007
001
008
000
009
005
010
004
011
003
012
002
034
033
035
032
036
838
037
900
038
199
039
890
040
111
041
870
042
700
043
366
044
890
045
111
046
870
047
669
048
109
049
769
050
890
051
838
052
880
053
111
054
870
055
669
056
569
057
111
058
870
059
206
060
661
061
100
062
890
063
838
064
880
065
880
066
880
067
668
068
800
070
679
071
199
072
678
073
179
074
203
075
204
076
677
077
100
078
800
080
199
081
688
082
103
083
200
084
603
085
204
086
687
087
100
088
800
090
689
091
103
092
205
093
698
094
103
095
700
096
603
097
189
098
600
002
834
003
000
```

### Pythagorian Triples

The next example comes courtesy of Mark and Will Tapley. It finds sets of three integers which satisfy the Pythagorian property of x2+y2=z2.

#### Discussion

There is much motivation and explanation for this program at:

##### Subroutine to calculate square of a number:

In finding pythagorean triplets, the operation of squaring a number occurs very often, so the program uses a subroutine to perform this function.

• Addresses 076–099 are loaded with the subroutine to utilize the return function hard-wired at address 099.
• Addresses 072–075 are used for data storage for the subroutine.
• Address 072 is loaded with the value 32, one larger than the largest allowable input. The calling program can test an input by subtracting this value from the prospective input and branching if the result is negative. (Negative value means legal input.)
• Address 073 accepts the input to the subroutine. On return, the absolute value of the input will be in this location.
• Address 074 is used as a counter during routine execution.
• Address 075 will contain the calculated square, an integer between 0 and 961 inclusive.
##### Subroutine INPUT:

Store the number to be squared in address 073

-OR-

Load the number to be squared into the accumulator

##### Subroutine OUTPUT:

On return, the square of the input number is in address 075.

The subroutine has a single loop (addresses 090–098). In each loop, it subtracts one from a counter which is initially set to one greater than the input number N, then adds a copy of N into the output address. When the counter reaches 1, the output address contains the sum of N copies of N=N2 and the loop exits, returning program control to the location from which it was called (per the return capability special function of location 99).

##### Limitations:

The square of the input number must have 3 or fewer digits to comply with cell storage limitations. Therefore the input number is checked to be 31 or less (since 322=1024). Violating this condition will cause the subroutine to terminate execution (HRS) with the program counter pointing at location 086. The input number is converted from negative to positive if it was negative, so if the calling program needs a copy of the input, it should store it in some location other than Address 073 (SQIN). After the subroutine executes, that location will contain the absolute value of the input.

##### Main Program:

The main program searches over all allowable lengths of the shortest side S of the right triangles corresponding to pythagorean triplets. For each shortest side, it then searches over all possible lengths of the intermediate side L. For each combination of short and intermediate sides, it checks whether there is a hypotenuse H that satisfies the condition S2+L2=H2. The short side (S) search starts at 0, to avoid missing any triplets with very small values. (This results in identifying the degenerate triplet (0,1,1) which does satisfy 02+12=12 but does not really correspond to a right triangle.) The long side (L) search for each value of S starts at S+1, because L cannot equal S for an integer triplet (see URL above) and if L<S, the corresponding triplet should already have been found with a smaller S. (So, this program will identify (3,4,5) but will not identify (4,3,5).) The hypotenuse (H) search starts at 1.4 times S, since the minimum possible length of the hypotenuse is greater than the square root of 2 (1.404…) times S. (Note: 1.4 times S is calculated by shifting S right and then adding four copies of the result, which is truncated to an integer, to S. For S<10, the result is just S, so the search takes needlessly long until S≥10.)

With the starting values for S, L, and H, the program calculates S2 + L2−H2. If the result is <0, H is too long. In this case, the program increments L and tries again. If the result is =0, a triplet has been found and is printed out. The program then increments L and tries again. If the result is >0, H is too short. In this case, H is incremented and the program tries again. When H is long enough that no more triplets can be found for this value of S, the value of S is incremented, new L and H starting values are calculated, and the loop repeats.

• Addresses 004–009 are used for data storage.
• Address 004 contains S, the smallest member of the triplet (length of the short leg of the triangle) and is initially set to 0.
• Address 005 contains S2, calculated each time S is changed.
• Address 006 contains L, the intermediate member of the triplet (length of the long “leg” of the triangle) and is re-initialized for each smallest member loop to one greater than the smallest member (which is always the minimum possible value for L; see above)
• Address 007 contains L2, calculated each time L is changed.
• Address 008 contains H, the largest member of the triplet (length of the hypotenuse of the triangle) and is initialized for each smallest member to a value <1.4×(the smallest value) (which is always shorter than the minimum possible value for H)
• Address 009 contains H2, calculated each time H is changed. The same address also contains S/10 (S shifted right by one place), used to initialize H each time S is changed. This value is used to set the initial value of H to 1.4 S, which is just less than √2S.

The “outside” loop of the program (addresses 010–067) tests for all possible sets of triplets with the smallest value S stored in 004. After each loop, it increments the value of S and tries again. This loop will terminate when the value of 1.4×S exceeds 31, since the subroutine will no longer be able to calculate correct squares for any possible hypotenuse value (H). The subroutine will halt execution when this input is sent to it. (The outer loop also contains a check to verify that the value of S itself doesn’t exceed 31, but this check is never reached.)

The next-inner loop (addresses 032–061) starts with a value of L=S+1. Any smaller, and L would take the role of S (and hence, the resulting triplet would have already been found with a smaller S) or would be qual to S (and the length of the corresponding hypotenuse would be irrational). This loop terminates on one of two conditions: first, when the value of H exceeds 31 (in which case the subroutine to calculate squares can no longer work); or second, when 2L>S2. This latter condition applies because once L exceeds S2/2, L2 and H2 cannot differ by as little as S2 even if H=L+1. At that point, H2−L2 = (L+1)2−L2=2L+1>S2.

The innermost section (addresses 032–044) calculates the difference S2+L2−H2. If the difference is positive, H is incremented and the loop repeats. If the difference is zero, a triplet has been found and the values of S, L, and H are printed out. If the difference is negative or zero, L is then incremented and the loop repeats. In any case where H is incremented, its new value is checked against the limit for inputs to the subroutine, and if it exceeds that limit, the inner two loops terminate and the outer loop progresses to the next value of S.

##### Independent Verification:

The code below is instructions to Mathematica (tested on versions 8 and 3) which should compute the same output as the above program, but using a more general (and slower) algorithm. It will also generate a plot of triplets by (short side) against (intermediate side).

```candid =
Table[
Table[
Table[
{i, j, k},
{k, j, i^2/2 + 2}
],
{j, i+1, i^2/2 + 1}
],
{i, 0, 31}
];

trips = Select[Flatten[candid, 2], #1[[1]]^2 + #1[[2]]^2 == #1[[3]]^2 & ];

smalltrips = Select[trips, #1[[3]] < 32 & ]

ListPlot[(Take[#1, 2] & ) /@ trips]
```

#### Program Listing

##### Symbol map:
```Address		Variable
04		S			short side = 0 initially
05		S2			square of short side
06		L			long side
07		L2			square of long side
08		H			hypotenuse
09		H2			square of hypotenuse. (Also used
to store S/10 in picking initial
value of H each loop.)
--		----
72		SQLIM			maximum input to Square = 30 initially
73		SQIN			input to square subroutine
74		SQCNT			counter for square subroutine
75		SQOUT			output for square subroutine

Address		Name (as referenced by JMP instruction)
00		BootLp
10		S_Loop
32		L_Loop
45		Next_H
49		PrintTr
52		Inc_L
62		Next_S
--		-----
76		SQacc
77		SQmem
83		SQpos
87		SQgood
90		SQloop

BootLp:
002	800		JMP 	BootLp	Bootstrap loop. Code self-modifies

004	000		(variable)	S Initial value for Short side = 0
072	032		(constant)	SQLIM Limit on input to square = 32

S_Loop:
010	104		CLA	S
011	673		STO	SQIN	Input to square subroutine
012	200		ADD	1	(Using ROM value)
013	606		STO	L	Save long side L
014	877		JMP	SQmem	Square subroutine (saved entry)
015	175		CLA	SQOUT	Retrieve result of subroutine
016	605		STO	S2	Store square of S
017	106		CLA	L	Load L
018	876		JMP	SQacc	Square subroutine, entry using ACC
019	175		CLA	SQOUT	Retrieve result of subroutine
020	607		STO	L2	Store square of L
021	104		CLA	S	Load S
022	401		SFT	01	Divide by 10
023	609		STO	H2	Save S/10 temporarily in H2 location
024	209		ADD	H2	Sum into accumulator
025	209		ADD	H2	Sum into accumulator
026	209		ADD	H2	Sum into accumulator
027	204		ADD	S	Sum is now between S and 1.4 S ~ S sqrt(2)
028	608		STO	H	Store initial hypotenuse H
029	876		JMP	SQacc	Square subroutine (accumulator entry)
030	175		CLA	SQOUT
031	609		STO	H2	Store square of H

L_Loop:
032	105		CLA	S2	Load short side squared
034	709		SUB	H2	Subtract hyp. squared
035	352		TAC	Inc_L	if H2 too big, increment L
036	700		SUB	1	Subtract 1 (ROM)
037	349		TAC	PrintTr	H was just right - print
038	108		CLA	H	H too small, so load H
040	608		STO	H	Store back
041	673		STO	SQIN	Save in input to Square routine
042	772		SUB	SQLIM	Subtract limit for input
043	345		TAC	Next_H	Go on if negative (input < 32)
044	862		JMP	Next_S	Branch to next value of S if not.

Next_H:
045	877		JMP	SQmem	(saved entry)
046	175		CLA	SQOUT	Get result
047	609		STO	H2
048	832		JMP	L_Loop	Try again

PrintTr:
049	504		OUT	S	Print S
050	506		OUT	L	Print L
051	508		OUT	H	Print H

Inc_L:
052	106		CLA	L	Load L
054	606		STO	L	Store
055	876		JMP	SQacc	Square subr.
056	175		CLA	SQOUT	get result
057	607		STO	L2	Store new L squared
058	106		CLA	L	Load new L
059	206		ADD	L	Double it
060	705		SUB	S2	Subtract S^2
061	332		TAC	L_Loop	If S^2 still bigger, keep looking

Next_S:
062	104		CLA	S	Load short side S
064	604		STO	S	Store short side S
065	772		SUB	SQLIM	Subtract upper limit for Square
066	310		TAC	S_Loop	If result is negative, new S is low
enough to loop again
067	900		HRS		Else, S is longer than Square can handle,
so Done - exit.

---	---		--------
SQacc:
076	673		STO	SQIN	Jump here if input value is in ACC

SQmem:
077	173		CLA	SQIN	Jump here if input is already in SQIN
078	773		SUB	SQIN	Input was in both accumulator and SQIN, so this gets 0
079	675		STO	SQOUT	initialize output to 0 for use later
080	773		SUB	SQIN	This gets negative of SQIN
081	383		TAC	SQpos	If the negative is negative, SQIN is positive - good.
082	673		STO	SQIN	If the negative is positive, store that in SQIN.

SQpos:
083	173		CLA	SQIN	Load Absolute value of input
084	772		SUB	SQLIM	Compare against limit value
085	387		TAC	SQgood	Quit if number to square > limit
086	986		HRS		Halt if error on input.

SQgood:
087	173		CLA	SQIN	Retrieve number
089	674		STO	SQCNT	Count is input + 1

SQloop:
090	174		CLA	SQCNT	load counter
091	700		SUB	0	subtract 1
092	674		STO	SQCNT	save new counter value
093	175		CLA	SQOUT	load output
095	675		STO	SQOUT	store cumulative sum
096	100		CLA	0	load 1 (from ROM)
097	774		SUB	SQCNT	subtract counter
098	390		TAC	SQloop	loop again if counter was > 1

Jump out of boot loop to 10 (skips initial increment to S)
002	810		JMP	S_Loop
```

```002
800
004
000
072
032
010
104
011
673
012
200
013
606
014
877
015
175
016
605
017
106
018
876
019
175
020
607
021
104
022
401
023
609
024
209
025
209
026
209
027
204
028
608
029
876
030
175
031
609
032
105
033
207
034
709
035
352
036
700
037
349
038
108
039
200
040
608
041
673
042
772
043
345
044
862
045
877
046
175
047
609
048
832
049
504
050
506
051
508
052
106
053
200
054
606
055
876
056
175
057
607
058
106
059
206
060
705
061
332
062
104
063
200
064
604
065
772
066
310
067
900
076
673
077
173
078
773
079
675
080
773
081
383
082
673
083
173
084
772
085
387
086
986
087
173
088
200
089
674
090
174
091
700
092
674
093
175
094
273
095
675
096
100
097
774
098
390
002
810
```

# Výroba desek plošných spojů

## Nová brožura pro vývojáře

Na 20 stranách formátu A4 najdete celou škálu tabulek, nákresů, fotek nebo popisů důležitých pro kvalitní návrh desek plošných spojů. Přináší důležité informace a odpovědi na vaše dotazy. Navrhnutý stackup, dodržení aspect ratia vrtaného otvoru, výběr správné technologie atp.

OBJEDNAT ZDARMA

Tyto stránky používají soubory cookie k poskytování služeb a analýze návštěvnosti. Soubory cookie používáme také k personalizaci reklam, když na tomto webu na něco kliknete nebo přejdete, vyjádříte tím svůj souhlas k našemu používání cookie.Přečtěte si další informace, mimo jiné i to, jaké máte možnosti.

## https://www.tutorialspoint.com/execute_python_online.php

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

 Numerical Computation Guide

# What Every Computer Scientist Should Know About Floating-Point Arithmetic

Note – This appendix is an edited reprint of the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys. Copyright 1991, Association for Computing Machinery, Inc., reprinted by permission.

## Abstract

Floating-point arithmetic is considered an esoteric subject by many people. This is rather surprising because floating-point is ubiquitous in computer systems. Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on those aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating-point standard, and concludes with numerous examples of how computer builders can better support floating-point.

Categories and Subject Descriptors: (Primary) C.0 [Computer Systems Organization]: General — instruction set design; D.3.4 [Programming Languages]: Processors — compilers, optimization; G.1.0 [Numerical Analysis]: General — computer arithmetic, error analysis, numerical algorithms (Secondary)

D.2.1 [Software Engineering]: Requirements/Specifications — languages; D.3.4 Programming Languages]: Formal Definitions and Theory — semantics; D.4.1 Operating Systems]: Process Management — synchronization.

General Terms: Algorithms, Design, Languages

Additional Key Words and Phrases: Denormalized number, exception, floating-point, floating-point standard, gradual underflow, guard digit, NaN, overflow, relative error, rounding error, rounding mode, ulp, underflow.

## Introduction

Builders of computer systems often need information about floating-point arithmetic. There are, however, remarkably few sources of detailed information about it. One of the few books on the subject, Floating-Point Computation by Pat Sterbenz, is long out of print. This paper is a tutorial on those aspects of floating-point arithmetic (floating-point hereafter) that have a direct connection to systems building. It consists of three loosely connected parts. The first section, Rounding Error, discusses the implications of using different rounding strategies for the basic operations of addition, subtraction, multiplication and division. It also contains background information on the two methods of measuring rounding error, ulps and `relative` `error`. The second part discusses the IEEE floating-point standard, which is becoming rapidly accepted by commercial hardware manufacturers. Included in the IEEE standard is the rounding method for basic operations. The discussion of the standard draws on the material in the section Rounding Error. The third part discusses the connections between floating-point and the design of various aspects of computer systems. Topics include instruction set design, optimizing compilers and exception handling.

I have tried to avoid making statements about floating-point without also giving reasons why the statements are true, especially since the justifications involve nothing more complicated than elementary calculus. Those explanations that are not central to the main argument have been grouped into a section called “The Details,” so that they can be skipped if desired. In particular, the proofs of many of the theorems appear in this section. The end of each proof is marked with the z symbol. When a proof is not included, the z appears immediately following the statement of the theorem.

## Rounding Error

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. The section Relative Error and Ulps describes how it is measured.

Since most floating-point calculations have rounding error anyway, does it matter if the basic arithmetic operations introduce a little bit more rounding error than necessary? That question is a main theme throughout this section. The section Guard Digits discusses guard digits, a means of reducing the error when subtracting two nearby numbers. Guard digits were considered sufficiently important by IBM that in 1968 it added a guard digit to the double precision format in the System/360 architecture (single precision already had a guard digit), and retrofitted all existing machines in the field. Two examples are given to illustrate the utility of guard digits.

The IEEE standard goes further than just requiring the use of a guard digit. It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard. This greatly simplifies the porting of programs. Other uses of this precise specification are given in Exactly Rounded Operations.

### Floating-point Formats

Several different representations of real numbers have been proposed, but by far the most widely used is the floating-point representation.1Floating-point representations have a base  (which is always assumed to be even) and a precision p. If  = 10 and p = 3, then the number 0.1 is represented as 1.00 × 10-1. If  = 2 and p = 24, then the decimal number 0.1 cannot be represented exactly, but is approximately 1.10011001100110011001101 × 2-4.

In general, a floating-point number will be represented as ± d.dd… d × e, where d.dd… d is called the significand2 and has p digits. More precisely ± d0 . d1 d2 … dp-1 × e represents the number

(1)  .
The term floating-point number will be used to mean a real number that can be exactly represented in the format under discussion. Two other parameters associated with floating-point representations are the largest and smallest allowable exponents, emax and emin. Since there are ppossible significands, and emax – emin + 1 possible exponents, a floating-point number can be encoded in

bits, where the final +1 is for the sign bit. The precise encoding is not important for now.

There are two reasons why a real number might not be exactly representable as a floating-point number. The most common situation is illustrated by the decimal number 0.1. Although it has a finite decimal representation, in binary it has an infinite repeating representation. Thus when  = 2, the number 0.1 lies strictly between two floating-point numbers and is exactly representable by neither of them. A less common situation is that a real number is out of range, that is, its absolute value is larger than  ×  or smaller than 1.0 ×  . Most of this paper discusses issues due to the first reason. However, numbers that are out of range will be discussed in the sections Infinity and Denormalized Numbers.

Floating-point representations are not necessarily unique. For example, both 0.01 × 101 and 1.00 × 10-1 represent 0.1. If the leading digit is nonzero (d0  0 in equation (1) above), then the representation is said to be normalized. The floating-point number 1.00 × 10-1 is normalized, while 0.01 × 101 is not. When  = 2, p = 3, emin = -1 and emax = 2 there are 16 normalized floating-point numbers, as shown in FIGURE D-1. The bold hash marks correspond to numbers whose significand is 1.00. Requiring that a floating-point representation be normalized makes the representation unique. Unfortunately, this restriction makes it impossible to represent zero! A natural way to represent 0 is with 1.0 ×  , since this preserves the fact that the numerical ordering of nonnegative real numbers corresponds to the lexicographic ordering of their floating-point representations.3 When the exponent is stored in a k bit field, that means that only 2k – 1 values are available for use as exponents, since one must be reserved to represent 0.

Note that the × in a floating-point number is part of the notation, and different from a floating-point multiply operation. The meaning of the ×symbol should be clear from the context. For example, the expression (2.5 × 10-3) × (4.0 × 102) involves only a single floating-point multiplication.

FIGURE D-1 Normalized numbers when  = 2, p = 3, emin = -1, emax = 2

### Relative Error and Ulps

Since rounding error is inherent in floating-point computation, it is important to have a way to measure this error. Consider the floating-point format with  = 10 and p = 3, which will be used throughout this section. If the result of a floating-point computation is 3.12 × 10-2, and the answer when computed to infinite precision is .0314, it is clear that this is in error by 2 units in the last place. Similarly, if the real number .0314159 is represented as 3.14 × 10-2, then it is in error by .159 units in the last place. In general, if the floating-point number d.dd × e is used to represent z, then it is in error by d.dd – (z/e)p-1 units in the last place.4, 5 The term ulps will be used as shorthand for “units in the last place.” If the result of a calculation is the floating-point number nearest to the correct result, it still might be in error by as much as .5 ulp. Another way to measure the difference between a floating-point number and the real number it is approximating is relative error, which is simply the difference between the two numbers divided by the real number. For example the relative error committed when approximating 3.14159 by 3.14 × 100 is .00159/3.14159  .0005.

To compute the relative error that corresponds to .5 ulp, observe that when a real number is approximated by the closest possible floating-point number d.dddd × e, the error can be as large as 0.00…00‘ × e, where ‘ is the digit /2, there are p units in the significand of the floating-point number, and p units of 0 in the significand of the error. This error is ((/2)-p) × e. Since numbers of the form d.dddd × e all have the same absolute error, but have values that range between e and  × e, the relative error ranges between ((/2)-p) × e/e and ((/2)-p) × e/e+1. That is,

(2)
In particular, the relative error corresponding to .5 ulp can vary by a factor of . This factor is called the wobble. Setting  = (/2)-p to the largest of the bounds in (2) above, we can say that when a real number is rounded to the closest floating-point number, the relative error is always bounded by e, which is referred to as machine epsilon.

In the example above, the relative error was .00159/3.14159  .0005. In order to avoid such small numbers, the relative error is normally written as a factor times , which in this case is  = (/2)-p = 5(10)-3 = .005. Thus the relative error would be expressed as (.00159/3.14159)/.005)   0.1.

To illustrate the difference between ulps and relative error, consider the real number x = 12.35. It is approximated by  = 1.24 × 101. The error is 0.5 ulps, the relative error is 0.8. Next consider the computation 8 . The exact value is 8x = 98.8, while the computed value is 8 = 9.92 ×101. The error is now 4.0 ulps, but the relative error is still 0.8. The error measured in ulps is 8 times larger, even though the relative error is the same. In general, when the base is , a fixed relative error expressed in ulps can wobble by a factor of up to . And conversely, as equation (2) above shows, a fixed error of .5 ulps results in a relative error that can wobble by .

The most natural way to measure rounding error is in ulps. For example rounding to the nearest floating-point number corresponds to an error of less than or equal to .5 ulp. However, when analyzing the rounding error caused by various formulas, relative error is a better measure. A good illustration of this is the analysis in the section Theorem 9. Since  can overestimate the effect of rounding to the nearest floating-point number by the wobble factor of , error estimates of formulas will be tighter on machines with a small .

When only the order of magnitude of rounding error is of interest, ulps and  may be used interchangeably, since they differ by at most a factor of . For example, when a floating-point number is in error by n ulps, that means that the number of contaminated digits is log n. If the relative error in a computation is n, then

(3) contaminated digits  log n.

### Guard Digits

One method of computing the difference between two floating-point numbers is to compute the difference exactly and then round it to the nearest floating-point number. This is very expensive if the operands differ greatly in size. Assuming p = 3, 2.15 × 1012 – 1.25 × 10-5 would be calculated as

x = 2.15 × 1012
y = .0000000000000000125 × 1012
x – y = 2.1499999999999999875 × 1012

which rounds to 2.15 × 1012. Rather than using all these digits, floating-point hardware normally operates on a fixed number of digits. Suppose that the number of digits kept is p, and that when the smaller operand is shifted right, digits are simply discarded (as opposed to rounding). Then 2.15 × 1012 – 1.25 × 10-5 becomes

x = 2.15 × 1012
y = 0.00 × 1012
x – y = 2.15 × 1012

The answer is exactly the same as if the difference had been computed exactly and then rounded. Take another example: 10.1 – 9.93. This becomes

x = 1.01 × 101
y = 0.99 × 101
x – y = .02 × 101

The correct answer is .17, so the computed difference is off by 30 ulps and is wrong in every digit! How bad can the error be?

#### Theorem 1

Using a floating-point format with parameters  and p, and computing differences using p digits, the relative error of the result can be as large as   1.

#### Proof

A relative error of  – 1 in the expression x – y occurs when x = 1.00…0 and y = ., where  =  – 1. Here y has p digits (all equal to ). The exact difference is x – y = p. However, when computing the answer using only p digits, the rightmost digit of y gets shifted off, and so the computed difference is p+1. Thus the error is p – p+1 = p ( – 1), and the relative error is p( – 1)/p =  – 1. z

When =2, the relative error can be as large as the result, and when =10, it can be 9 times larger. Or to put it another way, when =2, equation (3) shows that the number of contaminated digits is log2(1/) = log2(2p) = p. That is, all of the p digits in the result are wrong! Suppose that one extra digit is added to guard against this situation (a guard digit). That is, the smaller number is truncated to p + 1 digits, and then the result of the subtraction is rounded to p digits. With a guard digit, the previous example becomes

x = 1.010 × 101
y = 0.993 × 101
x – y = .017 × 101

and the answer is exact. With a single guard digit, the relative error of the result may be greater than , as in 110 – 8.59.

x = 1.10 × 102
y = .085 × 102
x – y = 1.015 × 102

This rounds to 102, compared with the correct answer of 101.41, for a relative error of .006, which is greater than  = .005. In general, the relative error of the result can be only slightly larger than . More precisely,

#### Theorem 2

If x and y are floating-point numbers in a format with parameters  and p, and if subtraction is done with p + 1 digits (i.e. one guard digit), then the relative rounding error in the result is less than 2.

This theorem will be proven in Rounding Error. Addition is included in the above theorem since x and y can be positive or negative.

### Cancellation

The last section can be summarized by saying that without a guard digit, the relative error committed when subtracting two nearby quantities can be very large. In other words, the evaluation of any expression containing a subtraction (or an addition of quantities with opposite signs) could result in a relative error so large that all the digits are meaningless (Theorem 1). When subtracting nearby quantities, the most significant digits in the operands match and cancel each other. There are two kinds of cancellation: catastrophic and benign.

Catastrophic cancellation occurs when the operands are subject to rounding errors. For example in the quadratic formula, the expression b2 – 4ac occurs. The quantities b2 and 4ac are subject to rounding errors since they are the results of floating-point multiplications. Suppose that they are rounded to the nearest floating-point number, and so are accurate to within .5 ulp. When they are subtracted, cancellation can cause many of the accurate digits to disappear, leaving behind mainly digits contaminated by rounding error. Hence the difference might have an error of many ulps. For example, consider b = 3.34, a = 1.22, and c = 2.28. The exact value of b2 – 4ac is .0292. But b2 rounds to 11.2 and 4ac rounds to 11.1, hence the final answer is .1 which is an error by 70 ulps, even though 11.2 – 11.1 is exactly equal to .16. The subtraction did not introduce any error, but rather exposed the error introduced in the earlier multiplications.

Benign cancellation occurs when subtracting exactly known quantities. If x and y have no rounding error, then by Theorem 2 if the subtraction is done with a guard digit, the difference x-y has a very small relative error (less than 2).

A formula that exhibits catastrophic cancellation can sometimes be rearranged to eliminate the problem. Again consider the quadratic formula

(4)
When  , then  does not involve a cancellation and

.
But the other addition (subtraction) in one of the formulas will have a catastrophic cancellation. To avoid this, multiply the numerator and denominator of r1 by

(and similarly for r2) to obtain

(5)
If  and  , then computing r1 using formula (4) will involve a cancellation. Therefore, use formula (5) for computing r1 and (4) for r2. On the other hand, if b < 0, use (4) for computing r1 and (5) for r2.

The expression x2 – y2 is another formula that exhibits catastrophic cancellation. It is more accurate to evaluate it as (x – y)(x + y).7 Unlike the quadratic formula, this improved form still has a subtraction, but it is a benign cancellation of quantities without rounding error, not a catastrophic one. By Theorem 2, the relative error in x – y is at most 2. The same is true of x + y. Multiplying two quantities with a small relative error results in a product with a small relative error (see the section Rounding Error).

In order to avoid confusion between exact and computed values, the following notation is used. Whereas x – y denotes the exact difference of xand yx  y denotes the computed difference (i.e., with rounding error). Similarly , and  denote computed addition, multiplication, and division, respectively. All caps indicate the computed value of a function, as in `LN(x)` or `SQRT(x)`. Lowercase functions and traditional mathematical notation denote their exact values as in ln(x) and  .

Although (x  y (x  y) is an excellent approximation to x2 – y2, the floating-point numbers x and y might themselves be approximations to some true quantities  and  . For example,  and  might be exactly known decimal numbers that cannot be expressed exactly in binary. In this case, even though x   y is a good approximation to x – y, it can have a huge relative error compared to the true expression  , and so the advantage of (x + y)(x – y) over x2 – y2 is not as dramatic. Since computing (x + y)(x – y) is about the same amount of work as computing x2 – y2, it is clearly the preferred form in this case. In general, however, replacing a catastrophic cancellation by a benign one is not worthwhile if the expense is large, because the input is often (but not always) an approximation. But eliminating a cancellation entirely (as in the quadratic formula) is worthwhile even if the data are not exact. Throughout this paper, it will be assumed that the floating-point inputs to an algorithm are exact and that the results are computed as accurately as possible.

The expression x2 – y2 is more accurate when rewritten as (x – y)(x + y) because a catastrophic cancellation is replaced with a benign one. We next present more interesting examples of formulas exhibiting catastrophic cancellation that can be rewritten to exhibit only benign cancellation.

The area of a triangle can be expressed directly in terms of the lengths of its sides ab, and c as

(6)
(Suppose the triangle is very flat; that is, a  b + c. Then s  a, and the term (s – a) in formula (6) subtracts two nearby numbers, one of which may have rounding error. For example, if a = 9.0, b = c = 4.53, the correct value of s is 9.03 and A is 2.342…. Even though the computed value of s (9.05) is in error by only 2 ulps, the computed value of A is 3.04, an error of 70 ulps.

There is a way to rewrite formula (6) so that it will return accurate results even for flat triangles [Kahan 1986]. It is

(7)
If ab, and c do not satisfy a  b  c, rename them before applying (7). It is straightforward to check that the right-hand sides of (6) and (7)are algebraically identical. Using the values of ab, and c above gives a computed area of 2.35, which is 1 ulp in error and much more accurate than the first formula.

Although formula (7) is much more accurate than (6) for this example, it would be nice to know how well (7) performs in general.

#### Theorem 3

The rounding error incurred when using (7) to compute the area of a triangle is at most 11, provided that subtraction is performed with a guard digit, e  .005, and that square roots are computed to within 1/2 ulp.

The condition that e < .005 is met in virtually every actual floating-point system. For example when  = 2, p  8 ensures that e < .005, and when  = 10, p  3 is enough.

In statements like Theorem 3 that discuss the relative error of an expression, it is understood that the expression is computed using floating-point arithmetic. In particular, the relative error is actually of the expression

(8) `SQRT`(( (b  c))  (c  (a  b))  ( ( b))  (a  (b  c)))  4
Because of the cumbersome nature of (8), in the statement of theorems we will usually say the computed value of E rather than writing out Ewith circle notation.

Error bounds are usually too pessimistic. In the numerical example given above, the computed value of (7) is 2.35, compared with a true value of 2.34216 for a relative error of 0.7, which is much less than 11. The main reason for computing error bounds is not to get precise bounds but rather to verify that the formula does not contain numerical problems.

A final example of an expression that can be rewritten to use benign cancellation is (1 + x)n, where  . This expression arises in financial calculations. Consider depositing \$100 every day into a bank account that earns an annual interest rate of 6%, compounded daily. If n = 365 and i = .06, the amount of money accumulated at the end of one year is

100
dollars. If this is computed using  = 2 and p = 24, the result is \$37615.45 compared to the exact answer of \$37614.05, a discrepancy of \$1.40. The reason for the problem is easy to see. The expression 1 + i/n involves adding 1 to .0001643836, so the low order bits of i/n are lost. This rounding error is amplified when 1 + i/n is raised to the nth power.

The troublesome expression (1 + i/n)n can be rewritten as enln(1 + i/n), where now the problem is to compute ln(1 + x) for small x. One approach is to use the approximation ln(1 + x x, in which case the payment becomes \$37617.26, which is off by \$3.21 and even less accurate than the obvious formula. But there is a way to compute ln(1 + x) very accurately, as Theorem 4 shows [Hewlett-Packard 1982]. This formula yields \$37614.07, accurate to within two cents!

Theorem 4 assumes that `LN(x)` approximates ln(x) to within 1/2 ulp. The problem it solves is that when x is small, `LN`(1  x) is not close to ln(1 + x) because 1  x has lost the information in the low order bits of x. That is, the computed value of ln(1 + x) is not close to its actual value when  .

#### Theorem 4

If ln(1 + x) is computed using the formula

the relative error is at most 5 when 0  x < 3/4, provided subtraction is performed with a guard digit, e < 0.1, and ln is computed to within 1/2 ulp.

This formula will work for any value of x but is only interesting for  , which is where catastrophic cancellation occurs in the naive formula ln(1 + x). Although the formula may seem mysterious, there is a simple explanation for why it works. Write ln(1 + x) as

.
The left hand factor can be computed exactly, but the right hand factor µ(x) = ln(1 + x)/x will suffer a large rounding error when adding 1 to x. However, µ is almost constant, since ln(1 + x x. So changing x slightly will not introduce much error. In other words, if  , computing will be a good approximation to xµ(x) = ln(1 + x). Is there a value for  for which  and  can be computed accurately? There is; namely  = (1  x 1, because then 1 +  is exactly equal to 1  x.

The results of this section can be summarized by saying that a guard digit guarantees accuracy when nearby precisely known quantities are subtracted (benign cancellation). Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using benign cancellation; however, the procedure only works if subtraction is performed using a guard digit. The price of a guard digit is not high, because it merely requires making the adder one bit wider. For a 54 bit double precision adder, the additional cost is less than 2%. For this price, you gain the ability to run many algorithms such as formula (6) for computing the area of a triangle and the expression ln(1 + x). Although most modern computers have a guard digit, there are a few (such as Cray systems) that do not.

### Exactly Rounded Operations

When floating-point operations are done with a guard digit, they are not as accurate as if they were computed exactly then rounded to the nearest floating-point number. Operations performed in this manner will be called exactly rounded.8 The example immediately preceding Theorem 2 shows that a single guard digit will not always give exactly rounded results. The previous section gave several examples of algorithms that require a guard digit in order to work properly. This section gives examples of algorithms that require exact rounding.

So far, the definition of rounding has not been given. Rounding is straightforward, with the exception of how to round halfway cases; for example, should 12.5 round to 12 or 13? One school of thought divides the 10 digits in half, letting {0, 1, 2, 3, 4} round down, and {5, 6, 7, 8, 9} round up; thus 12.5 would round to 13. This is how rounding works on Digital Equipment Corporation’s VAX computers. Another school of thought says that since numbers ending in 5 are halfway between two possible roundings, they should round down half the time and round up the other half. One way of obtaining this 50% behavior to require that the rounded result have its least significant digit be even. Thus 12.5 rounds to 12 rather than 13 because 2 is even. Which of these methods is best, round up or round to even? Reiser and Knuth [1975] offer the following reason for preferring round to even.

#### Theorem 5

Let x and y be floating-point numbers, and define x0 = x, x1 = (x0  y)  y, , xn = (xn-1   y)  y. If  and  are exactly rounded using round to even, then either xn = x for all n or xn = x1 for all n  1. z

To clarify this result, consider  = 10, p = 3 and let x = 1.00, y = -.555. When rounding up, the sequence becomes

x0  y = 1.56, x1 = 1.56  .555 = 1.01, x1   y = 1.01  .555 = 1.57,
and each successive value of xn increases by .01, until xn = 9.45 (n  845)9. Under round to even, xn is always 1.00. This example suggests that when using the round up rule, computations can gradually drift upward, whereas when using round to even the theorem says this cannot happen. Throughout the rest of this paper, round to even will be used.

One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language. The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic.

The key to multiplication in this system is representing a product xy as a sum, where each summand has the same precision as x and y. This can be done by splitting x and y. Writing x = xh + xl and y = yh + yl, the exact product is

xy = xh yh + xh yl + xl yh + xl yl.
If x and y have p bit significands, the summands will also have p bit significands provided that xlxhyhyl can be represented using [p/2] bits. When p is even, it is easy to find a splitting. The number x0.x1 … xp – 1 can be written as the sum of x0.x1 … xp/2 – 1 and 0.0 … 0xp/2 … xp – 1. When p is odd, this simple splitting method will not work. An extra bit can, however, be gained by using negative numbers. For example, if  = 2, p = 5, and x = .10111, x can be split as xh = .11 and xl = -.00001. There is more than one way to split a number. A splitting method that is easy to compute is due to Dekker [1971], but it requires more than a single guard digit.

#### Theorem 6

Let p be the floating-point precision, with the restriction that p is even when  > 2, and assume that floating-point operations are exactly rounded. Then if k = [p/2] is half the precision (rounded up) and m = k + 1, x can be split as x = xh + xl, where

xh = (m  x)   (m  x  x), xl = x  xh
and each xi is representable using [p/2] bits of precision.

To see how this theorem works in an example, let  = 10, p = 4, b = 3.476, a = 3.463, and c = 3.479. Then b2 – ac rounded to the nearest floating-point number is .03480, while b  b = 12.08, a  c = 12.05, and so the computed value of b2 – ac is .03. This is an error of 480 ulps. Using Theorem 6 to write b = 3.5 – .024, a = 3.5 – .037, and c = 3.5 – .021, b2 becomes 3.52 – 2 × 3.5 × .024 + .0242. Each summand is exact, so b2 = 12.25 – .168 + .000576, where the sum is left unevaluated at this point. Similarly, ac = 3.52 – (3.5 × .037 + 3.5 × .021) + .037 × .021 = 12.25 – .2030 +.000777. Finally, subtracting these two series term by term gives an estimate for b2 – ac of 0  .0350   .000201 = .03480, which is identical to the exactly rounded result. To show that Theorem 6 really requires exact rounding, consider p = 3,  = 2, and x = 7. Then m = 5, mx = 35, and m  x = 32. If subtraction is performed with a single guard digit, then (m  x x = 28. Therefore, xh = 4 and xl= 3, hence xl is not representable with [p/2] = 1 bit.

As a final example of exact rounding, consider dividing m by 10. The result is a floating-point number that will in general not be equal to m/10. When  = 2, multiplying m/10 by 10 will restore m, provided exact rounding is being used. Actually, a more general fact (due to Kahan) is true. The proof is ingenious, but readers not interested in such details can skip ahead to section The IEEE Standard.

#### Theorem 7

When  = 2, if m and n are integers with |m| < 2p – 1 and n has the special form n = 2i + 2j, then (m  n)  n = m, provided floating-point operations are exactly rounded.

#### Proof

Scaling by a power of two is harmless, since it changes only the exponent, not the significand. If q = m/n, then scale n so that 2p – 1  n < 2pand scale m so that 1/2 < q < 1. Thus, 2p – 2 < m < 2p. Since m has p significant bits, it has at most one bit to the right of the binary point. Changing the sign of m is harmless, so assume that q > 0.
If  =  n, to prove the theorem requires showing that

(9)

That is because m has at most 1 bit right of the binary point, so n will round to m. To deal with the halfway case when |n – m| = 1/4, note that since the initial unscaled m had |m| < 2p – 1, its low-order bit was 0, so the low-order bit of the scaled m is also 0. Thus, halfway cases will round to m.
Suppose that q = .q1q2 …, and let  = .q1q2 … qp1. To estimate |n – m|, first compute

| – q| = |N/2p + 1 – m/n|,

where N is an odd integer. Since n = 2i + 2j and 2p – 1  n < 2p, it must be that n = 2p – 1 + 2k for some k  p – 2, and thus

.

The numerator is an integer, and since N is odd, it is in fact an odd integer. Thus,

| – q 1/(n2p + 1 – k).

Assume q <  (the case q >  is similar).10 Then n < m, and

|m-n |= m-n = n(q- ) = n(q-( -2-p-1))
=(2p-1+2k)2-p-1-2-p-1+k =

This establishes (9) and proves the theorem.11 z

The theorem holds true for any base , as long as 2i + 2j is replaced by i + j. As  gets larger, however, denominators of the form i + j are farther and farther apart.

We are now in a position to answer the question, Does it matter if the basic arithmetic operations introduce a little more rounding error than necessary? The answer is that it does matter, because accurate basic operations enable us to prove that formulas are “correct” in the sense they have a small relative error. The section Cancellation discussed several algorithms that require guard digits to produce correct results in this sense. If the input to those formulas are numbers representing imprecise measurements, however, the bounds of Theorems 3 and 4 become less interesting. The reason is that the benign cancellation x – y can become catastrophic if x and y are only approximations to some measured quantity. But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships like those discussed in Theorems 6 and 7. These are useful even if every floating-point variable is only an approximation to some actual value.

## The IEEE Standard

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires  = 2, p = 24 for single precision and p = 53 for double precision [IEEE 1987]. It also specifies the precise layout of bits in a single and double precision. IEEE 854 allows either  = 2 or  = 10 and unlike 754, does not specify how floating-point numbers are encoded into bits [Cody et al. 1984]. It does not require a particular value for p, but instead it specifies constraints on the allowable values of p for single and double precision. The term IEEE Standard will be used when discussing properties common to both standards.

This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard and why it was included. It is not the purpose of this paper to argue that the IEEE standard is the best possible floating-point standard but rather to accept the standard as given and provide an introduction to its use. For full details consult the standards themselves [IEEE 1987; Cody et al. 1984].

### Formats and Operations

#### Base

It is clear why IEEE 854 allows  = 10. Base ten is how humans exchange and think about numbers. Using  = 10 is especially appropriate for calculators, where the result of each operation is displayed by the calculator in decimal.

There are several reasons why IEEE 854 requires that if the base is not 10, it must be 2. The section Relative Error and Ulps mentioned one reason: the results of error analyses are much tighter when  is 2 because a rounding error of .5 ulp wobbles by a factor of  when computed as a relative error, and error analyses are almost always simpler when based on relative error. A related reason has to do with the effective precision for large bases. Consider  = 16, p = 1 compared to  = 2, p = 4. Both systems have 4 bits of significand. Consider the computation of 15/8. When  = 2, 15 is represented as 1.111 × 23, and 15/8 as 1.111 × 20. So 15/8 is exact. However, when  = 16, 15 is represented as F × 160, where F is the hexadecimal digit for 15. But 15/8 is represented as 1 × 160, which has only one bit correct. In general, base 16 can lose up to 3 bits, so that a precision of p hexadecimal digits can have an effective precision as low as 4p – 3 rather than 4p binary bits. Since large values of  have these problems, why did IBM choose  = 16 for its system/370? Only IBM knows for sure, but there are two possible reasons. The first is increased exponent range. Single precision on the system/370 has  = 16, p = 6. Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and one for the sign bit. Thus the magnitude of representable numbers ranges from about  to about  =  . To get a similar exponent range when  = 2 would require 9 bits of exponent, leaving only 22 bits for the significand. However, it was just pointed out that when  = 16, the effective precision can be as low as 4p – 3 = 21 bits. Even worse, when  = 2 it is possible to gain an extra bit of precision (as explained later in this section), so the  = 2 machine has 23 bits of precision to compare with a range of 21 – 24 bits for the  = 16 machine.

Another possible explanation for choosing  = 16 has to do with shifting. When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation. In the  = 16, p = 1 system, all the numbers between 1 and 15 have the same exponent, and so no shifting is required when adding any of the ( ) = 105 possible pairs of distinct numbers from this set. However, in the  = 2, p = 4 system, these numbers have exponents ranging from 0 to 3, and shifting is required for 70 of the 105 pairs.

In most modern hardware, the performance gained by avoiding a shift for a subset of operands is negligible, and so the small wobble of  = 2 makes it the preferable base. Another advantage of using  = 2 is that there is a way to gain an extra bit of significance.12 Since floating-point numbers are always normalized, the most significant bit of the significand is always 1, and there is no reason to waste a bit of storage representing it. Formats that use this trick are said to have a hidden bit. It was already pointed out in Floating-point Formats that this requires a special convention for 0. The method given there was that an exponent of emin – 1 and a significand of all zeros represents not  , but rather 0.

IEEE 754 single precision is encoded in 32 bits using 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand. However, it uses a hidden bit, so the significand is 24 bits (p = 24), even though it is encoded using only 23 bits.

#### Precision

The IEEE standard defines four different precisions: single, double, single-extended, and double-extended. In IEEE 754, single and double precision correspond roughly to what most floating-point hardware provides. Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words. Extended precision is a format that offers at least a little extra precision and exponent range (TABLE D-1).

TABLE D-1   IEEE 754 Format Parameters
Parameter Format
Single Single-Extended Double Double-Extended
p 24  32 53  64
emax +127  1023 +1023 > 16383
emin -126  -1022 -1022  -16382
Exponent width in bits 8  11 11  15
Format width in bits 32  43 64  79

The IEEE standard only specifies a lower bound on how many extra bits extended precision provides. The minimum allowable double-extended format is sometimes referred to as 80-bit format, even though the table shows it using 79 bits. The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits.13

The standard puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that Implementations should support the extended format corresponding to the widest basic format supported,

One motivation for extended precision comes from calculators, which will often display 10 digits, but use 13 digits internally. By displaying only 10 of the 13 digits, the calculator appears to the user as a “black box” that computes exponentials, cosines, etc. to 10 digits of accuracy. For the calculator to compute functions like exp, log and cos to within 10 digits with reasonable efficiency, it needs a few extra digits to work with. It is not hard to find a simple rational expression that approximates log with an error of 500 units in the last place. Thus computing with 13 digits gives an answer correct to 10 digits. By keeping these extra 3 digits hidden, the calculator presents a simple model to the operator.

Extended precision in the IEEE standard serves a similar function. It enables libraries to efficiently compute quantities to within about .5 ulp in single (or double) precision, giving the user of those libraries a simple model, namely that each primitive operation, be it a simple multiply or an invocation of log, returns a value accurate to within about .5 ulp. However, when using extended precision, it is important to make sure that its use is transparent to the user. For example, on a calculator, if the internal representation of a displayed value is not rounded to the same precision as the display, then the result of further operations will depend on the hidden digits and appear unpredictable to the user.

To illustrate extended precision further, consider the problem of converting between IEEE 754 single precision and decimal. Ideally, single precision numbers will be printed with enough digits so that when the decimal number is read back in, the single precision number can be recovered. It turns out that 9 decimal digits are enough to recover a single precision binary number (see the section Binary to Decimal Conversion). When converting a decimal number back to its unique binary representation, a rounding error as small as 1 ulp is fatal, because it will give the wrong answer. Here is a situation where extended precision is vital for an efficient algorithm. When single-extended is available, a very straightforward method exists for converting a decimal number to a single precision binary one. First read in the 9 decimal digits as an integer N, ignoring the decimal point. From TABLE D-1p  32, and since 109 < 232  4.3 × 109N can be represented exactly in single-extended. Next find the appropriate power 10P necessary to scale N. This will be a combination of the exponent of the decimal number, together with the position of the (up until now) ignored decimal point. Compute 10|P|. If |P 13, then this is also represented exactly, because 1013 = 213513, and 513 < 232. Finally multiply (or divide if p < 0) N and 10|P|. If this last operation is done exactly, then the closest binary number is recovered. The section Binary to Decimal Conversion shows how to do the last multiply (or divide) exactly. Thus for |P 13, the use of the single-extended format enables 9-digit decimal numbers to be converted to the closest binary number (i.e. exactly rounded). If |P| > 13, then single-extended is not enough for the above algorithm to always compute the exactly rounded binary equivalent, but Coonen [1984] shows that it is enough to guarantee that the conversion of binary to decimal and back will recover the original binary number.

If double precision is supported, then the algorithm above would be run in double precision rather than single-extended, but to convert double precision to a 17-digit decimal number and back would require the double-extended format.

#### Exponent

Since the exponent can be positive or negative, some method must be chosen to represent its sign. Two common methods of representing signed numbers are sign/magnitude and two’s complement. Sign/magnitude is the system used for the sign of the significand in the IEEE formats: one bit is used to hold the sign, the rest of the bits represent the magnitude of the number. The two’s complement representation is often used in integer arithmetic. In this scheme, a number in the range [-2p-1, 2p-1 – 1] is represented by the smallest nonnegative number that is congruent to it modulo 2p.

The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation. In the case of single precision, where the exponent is stored in 8 bits, the bias is 127 (for double precision it is 1023). What this means is that if  is the value of the exponent bits interpreted as an unsigned integer, then the exponent of the floating-point number is  – 127. This is often called the unbiased exponent to distinguish from the biased exponent  .

Referring to TABLE D-1, single precision has emax = 127 and emin = -126. The reason for having |emin| < emax is so that the reciprocal of the smallest number  will not overflow. Although it is true that the reciprocal of the largest number will underflow, underflow is usually less serious than overflow. The section Base explained that emin – 1 is used for representing 0, and Special Quantities will introduce a use for emax + 1. In IEEE single precision, this means that the biased exponents range between emin – 1 = -127 and emax + 1 = 128, whereas the unbiased exponents range between 0 and 255, which are exactly the nonnegative numbers that can be represented using 8 bits.

#### Operations

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). The section Guard Digits pointed out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different. That section introduced guard digits, which provide a practical way of computing differences while guaranteeing that the relative error is small. However, computing with a single guard digit will not always give the same answer as computing the exact result and then rounding. By introducing a second guard digit and a third sticky bit, differences can be computed at only a little more cost than with a single guard digit, but the result is the same as if the difference were computed exactly and then rounded [Goldberg 1990]. Thus the standard can be implemented efficiently.

One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds. Theorem 4 is an example of such a proof. These proofs are made much easier when the operations being reasoned about are precisely specified. Once an algorithm is proven to be correct for IEEE arithmetic, it will work correctly on any machine supporting the IEEE standard.

Brown [1981] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operations, which require features not present on all hardware. Furthermore, Brown’s axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown’s axioms is usually more difficult than proving them assuming operations are exactly rounded.

There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers). Kulisch and Miranker [1986] have proposed adding inner product to the list of operations that are precisely specified. They note that when inner products are computed in IEEE arithmetic, the final answer can be quite wrong. For example sums are a special case of inner products, and the sum ((2 ×10-30 + 1030) – 1030) – 10-30 is exactly equal to 10-30, but on a machine with IEEE arithmetic the computed result will be -10-30. It is possible to compute inner products to within 1 ulp with less hardware than it takes to implement a fast multiplier [Kirchner and Kulish 1987].14 15

All the operations mentioned in the standard are required to be exactly rounded except conversion between decimal and binary. The reason is that efficient algorithms for exactly rounding all the operations are known, except conversion. For conversion, the best known efficient algorithms produce results that are slightly worse than exactly rounded ones [Coonen 1984].

The IEEE standard does not require transcendental functions to be exactly rounded because of the table maker’s dilemma. To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp(1.626) = 5.0835. Should this be rounded to 5.083 or 5.084? If exp(1.626) is computed more carefully, it becomes 5.08350. And then 5.083500. And then 5.0835000. Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp(1.626) is 5.083500…0ddd or 5.0834999…9ddd. Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically. But there does not appear to be a single algorithm that works well across all hardware architectures. Rational approximation, CORDIC,16 and large tables are three different techniques that are used for computing transcendentals on contemporary machines. Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware.

### Special Quantities

On some floating-point hardware every bit pattern represents a valid floating-point number. The IBM System/370 is an example of this. On the other hand, the VAXTM reserves some bit patterns to represent special numbers called reserved operands. This idea goes back to the CDC 6600, which had bit patterns for the special quantities `INDEFINITE` and `INFINITY`.

The IEEE standard continues in this tradition and has NaNs (Not a Number) and infinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation. Under IBM System/370 FORTRAN, the default action in response to computing the square root of a negative number like -4 results in the printing of an error message. Since every bit pattern represents a valid number, the return value of square root must be some floating-point number. In the case of System/370 FORTRAN,  is returned. In IEEE arithmetic, a NaN is returned in this situation.

The IEEE standard specifies the following special values (see TABLE D-2): ± 0, denormalized numbers, ± and NaNs (there is more than one NaN, as explained in the next section). These special values are all encoded with exponents of either emax + 1 or emin – 1 (it was already pointed out that 0 has an exponent of emin – 1).

TABLE D-2   IEEE 754 Special Values
Exponent Fraction Represents
e = emin – 1 f = 0 ±0
e = emin – 1 f  0
emin  e  emax 1.f × 2e
e = emax + 1 f = 0 ±
e = emax + 1 f  0 NaN

### NaNs

Traditionally, the computation of 0/0 or  has been treated as an unrecoverable error which causes a computation to halt. However, there are examples where it makes sense for a computation to continue in such a situation. Consider a subroutine that finds the zeros of a function f, say `zero(f)`. Traditionally, zero finders require the user to input an interval [ab] on which the function is defined and over which the zero finder will search. That is, the subroutine is called as `zero(f``a``b)`. A more useful zero finder would not require the user to input this extra information. This more general zero finder is especially appropriate for calculators, where it is natural to simply key in a function, and awkward to then have to specify the domain. However, it is easy to see why most zero finders require a domain. The zero finder does its work by probing the function `f`at various values. If it probed for a value outside the domain of `f`, the code for `f` might well compute 0/0 or  , and the computation would halt, unnecessarily aborting the zero finding process.

This problem can be avoided by introducing a special value called NaN, and specifying that the computation of expressions like 0/0 and produce NaN, rather than halting. A list of some of the situations that can cause a NaN are given in TABLE D-3. Then when `zero(f)` probes outside the domain of `f`, the code for `f` will return NaN, and the zero finder can continue. That is, `zero(f)` is not “punished” for making an incorrect guess. With this example in mind, it is easy to see what the result of combining a NaN with an ordinary floating-point number should be. Suppose that the final statement of `f` is `return(-b +` `sqrt(d))/(2*a)`. If d < 0, then `f` should return a NaN. Since d < 0, `sqrt(d)` is a NaN, and `-b + sqrt(d)` will be a NaN, if the sum of a NaN and any other number is a NaN. Similarly if one operand of a division operation is a NaN, the quotient should be a NaN. In general, whenever a NaN participates in a floating-point operation, the result is another NaN.

TABLE D-3   Operations That Produce a NaN
Operation `NaN` Produced By
+  + (- )
× ×
/ 0/0, /
`REM` x `REM` 0,  `REM` y
(when x < 0)

Another approach to writing a zero solver that doesn’t require the user to input a domain is to use signals. The zero-finder could install a signal handler for floating-point exceptions. Then if `f` was evaluated outside its domain and raised an exception, control would be returned to the zero solver. The problem with this approach is that every language has a different method of handling signals (if it has a method at all), and so it has no hope of portability.

In IEEE 754, NaNs are often represented as floating-point numbers with the exponent emax + 1 and nonzero significands. Implementations are free to put system-dependent information into the significand. Thus there is not a unique NaN, but rather a whole family of NaNs. When a NaN and an ordinary floating-point number are combined, the result should be the same as the NaN operand. Thus if the result of a long computation is a NaN, the system-dependent information in the significand will be the information that was generated when the first NaN in the computation was generated. Actually, there is a caveat to the last statement. If both operands are NaNs, then the result will be one of those NaNs, but it might not be the NaN that was generated first.

#### Infinity

Just as NaNs provide a way to continue a computation when expressions like 0/0 or  are encountered, infinities provide a way to continue when an overflow occurs. This is much safer than simply returning the largest representable number. As an example, consider computing , when  = 10, p = 3, and emax = 98. If x = 3 × 1070 and y = 4 × 1070, then x2 will overflow, and be replaced by 9.99 × 1098. Similarly y2, and x2 + y2 will each overflow in turn, and be replaced by 9.99 × 1098. So the final result will be  , which is drastically wrong: the correct answer is 5 × 1070. In IEEE arithmetic, the result of x2 is , as is y2x2 + y2 and  . So the final result is , which is safer than returning an ordinary floating-point number that is nowhere near the correct answer.17

The division of 0 by 0 results in a NaN. A nonzero number divided by 0, however, returns infinity: 1/0 = , -1/0 = –. The reason for the distinction is this: if f(x 0 and g(x 0 as x approaches some limit, then f(x)/g(x) could have any value. For example, when f(x) = sin x and g(x) = x, then f(x)/g(x 1 as x  0. But when f(x) = 1 – cos xf(x)/g(x 0. When thinking of 0/0 as the limiting situation of a quotient of two very small numbers, 0/0 could represent anything. Thus in the IEEE standard, 0/0 results in a NaN. But when c > 0, f(x cand g(x)0, then f(x)/g(x ±, for any analytic functions f and g. If g(x) < 0 for small x, then f(x)/g(x –, otherwise the limit is +. So the IEEE standard defines c/0 = ±, as long as c  0. The sign of  depends on the signs of c and 0 in the usual way, so that -10/0 = –, and -10/-0 = +. You can distinguish between getting  because of overflow and getting  because of division by zero by checking the status flags (which will be discussed in detail in section Flags). The overflow flag will be set in the first case, the division by zero flag in the second.

The rule for determining the result of an operation that has infinity as an operand is simple: replace infinity with a finite number x and take the limit as x  . Thus 3/ = 0, because

.
Similarly, 4 –  = –, and   = . When the limit doesn’t exist, the result is a NaN, so / will be a NaN (TABLE D-3 has additional examples). This agrees with the reasoning used to conclude that 0/0 should be a NaN.

When a subexpression evaluates to a NaN, the value of the entire expression is also a NaN. In the case of ± however, the value of the expression might be an ordinary floating-point number because of rules like 1/ = 0. Here is a practical example that makes use of the rules for infinity arithmetic. Consider computing the function x/(x2 + 1). This is a bad formula, because not only will it overflow when x is larger than  , but infinity arithmetic will give the wrong answer because it will yield 0, rather than a number near 1/x. However, x/(x2 + 1) can be rewritten as 1/(x + x-1). This improved expression will not overflow prematurely and because of infinity arithmetic will have the correct value when = 0: 1/(0 + 0-1) = 1/(0 + ) = 1/ = 0. Without infinity arithmetic, the expression 1/(x + x-1) requires a test for x = 0, which not only adds extra instructions, but may also disrupt a pipeline. This example illustrates a general fact, namely that infinity arithmetic often avoids the need for special case checking; however, formulas need to be carefully inspected to make sure they do not have spurious behavior at infinity (as x/(x2 + 1) did).

#### Signed Zero

Zero is represented by the exponent emin – 1 and a zero significand. Since the sign bit can take on two different values, there are two zeros, +0 and -0. If a distinction were made when comparing +0 and -0, simple tests like `if` `(x` `=` `0)` would have very unpredictable behavior, depending on the sign of `x`. Thus the IEEE standard defines comparison so that +0 = -0, rather than -0 < +0. Although it would be possible always to ignore the sign of zero, the IEEE standard does not do so. When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer. Thus 3·(+0) = +0, and +0/-3 = -0. If zero did not have a sign, then the relation 1/(1/x) = x would fail to hold when x = ±. The reason is that 1/- and 1/+ both result in 0, and 1/0 results in +, the sign information having been lost. One way to restore the identity 1/(1/x) = x is to only have one kind of infinity, however that would result in the disastrous consequence of losing the sign of an overflowed quantity.

Another example of the use of signed zero concerns underflow and functions that have a discontinuity at 0, such as log. In IEEE arithmetic, it is natural to define log 0 = – and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return –. Another example of a function with a discontinuity at zero is the signum function, which returns the sign of a number.

Probably the most interesting use of signed zero occurs in complex arithmetic. To take a simple example, consider the equation  . This is certainly true when z  0. If z = -1, the obvious computation gives  and  . Thus,  ! The problem can be traced to the fact that square root is multi-valued, and there is no way to select the values so that it is continuous in the entire complex plane. However, square root is continuous if a branch cut consisting of all negative real numbers is excluded from consideration. This leaves the problem of what to do for the negative real numbers, which are of the form –x + i0, where x > 0. Signed zero provides a perfect way to resolve this problem. Numbers of the form x + i(+0) have one sign  and numbers of the form x + i(-0) on the other side of the branch cut have the other sign  . In fact, the natural formulas for computing  will give these results.

Back to  . If z =1 = -1 + i0, then

1/z = 1/(-1 + i0) = [(-1- i0)]/[(-1 + i0)(-1 – i0)] = (-1 — i0)/((-1)2 – 02) = -1 + i(-0),
and so  , while  . Thus IEEE arithmetic preserves this identity for all z. Some more sophisticated examples are given by Kahan [1987]. Although distinguishing between +0 and -0 has advantages, it can occasionally be confusing. For example, signed zero destroys the relation x = y  1/x = 1/y, which is false when x = +0 and y = -0. However, the IEEE committee decided that the advantages of utilizing the sign of zero outweighed the disadvantages.

#### Denormalized Numbers

Consider normalized floating-point numbers with  = 10, = 3, and emin = -98. The numbers x = 6.87 × 10-97 and y = 6.81 × 10-97 appear to be perfectly ordinary floating-point numbers, which are more than a factor of 10 larger than the smallest floating-point number 1.00 × 10-98. They have a strange property, however: x  y = 0 even though x  y! The reason is that x – y = .06 × 10 -97  = 6.0 × 10-99 is too small to be represented as a normalized number, and so must be flushed to zero. How important is it to preserve the property

(10) x = y  x – y = 0 ?
It’s very easy to imagine writing the code fragment, `if` `(x`  `y)` `then` `z` `=` `1/(x-y)`, and much later having a program fail due to a spurious division by zero. Tracking down bugs like this is frustrating and time consuming. On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren’t going to be used as part of a proof. Floating-point code is just like any other code: it helps to have provable facts on which to depend. For example, when analyzing formula (6), it was very helpful to know that x/2 < y < 2x  x  y = x – y. Similarly, knowing that (10) is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything.

The IEEE standard uses denormalized18 numbers, which guarantee (10), as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting 754 approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard.19 The idea behind denormalized numbers goes back to Goldberg [1967] and is very simple. When the exponent is emin, the significand does not have to be normalized, so that when  = 10, p = 3 and emin = -98, 1.00 × 10-98 is no longer the smallest floating-point number, because 0.98 × 10-98 is also a floating-point number.

There is a small snag when  = 2 and a hidden bit is being used, since a number with an exponent of emin will always have a significand greater than or equal to 1.0 because of the implicit leading bit. The solution is similar to that used to represent 0, and is summarized in TABLE D-2. The exponent emin is used to represent denormals. More formally, if the bits in the significand field are b1, b2, …, bp -1, and the value of the exponent is e, then when e > emin – 1, the number being represented is 1.b1b2bp – 1 × 2e whereas when e = emin – 1, the number being represented is 0.b1b2bp – 1 × 2e + 1. The +1 in the exponent is needed because denormals have an exponent of emin, not emin – 1.

Recall the example of  = 10, p = 3, emin = -98, x = 6.87 × 10-97 and y = 6.81 × 10-97 presented at the beginning of this section. With denormals, x – y does not flush to zero but is instead represented by the denormalized number .6 × 10-98. This behavior is called gradual underflow. It is easy to verify that (10) always holds when using gradual underflow.

FIGURE D-2 Flush To Zero Compared With Gradual Underflow
FIGURE D-2 illustrates denormalized numbers. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number  . If the result of a floating-point calculation falls into this gulf, it is flushed to zero. The bottom number line shows what happens when denormals are added to the set of floating-point numbers. The “gulf” is filled in, and when the result of a calculation is less than  , it is represented by the nearest denormal. When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way: adjacent spacings are either the same length or differ by a factor of . Without denormals, the
spacing abruptly changes from  to  , which is a factor of  , rather than the orderly change by a factor of . Because of this, many algorithms that can have large relative error for normalized numbers close to the underflow threshold are well-behaved in this range when gradual underflow is used.

Without gradual underflow, the simple expression x – y can have a very large relative error for normalized inputs, as was seen above for x = 6.87 × 10-97 and y = 6.81 × 10-97. Large relative errors can happen even without cancellation, as the following example shows [Demmel 1984]. Consider dividing two complex numbers, a + ib and c + id. The obvious formula

· i
suffers from the problem that if either component of the denominator c + id is larger than  , the formula will overflow, even though the final result may be well within range. A better method of computing the quotients is to use Smith’s formula:

(11)

Applying Smith’s formula to (2 · 10-98 + i10-98)/(4 · 10-98 + i(2 · 10-98)) gives the correct answer of 0.5 with gradual underflow. It yields 0.4 with flush to zero, an error of 100 ulps. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1.0 x .

### Exceptions, Flags and Trap Handlers

When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. Typical of the default results are NaN for 0/0 and  , and  for 1/0 and overflow. The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do. When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags. The flags are “sticky” in that once set, they remain set until explicitly cleared. Testing the flags is the only way to distinguish 1/0, which is a genuine infinity from an overflow.

Sometimes continuing execution in the face of exception conditions is not appropriate. The section Infinity gave the example of x/(x2 + 1). When x >  , the denominator is infinite, resulting in a final answer of 0, which is totally wrong. Although for this formula the problem can be solved by rewriting it as 1/(x + x-1), rewriting may not always solve the problem. The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined.

The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in TABLE D-3, and any comparison that involves a NaN. The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. When one of the operands to an operation is a NaN, the result is a NaN but no invalid exception is raised unless the operation also satisfies one of the conditions in TABLE D-3.20

TABLE D-4   Exceptions in IEEE 754*
Exception Result when traps disabled Argument to trap handler
overflow ± or ±xmax round(x2)
underflow 0,  or denormal round(x2)
divide by zero ± operands
invalid NaN operands
inexact round(x) round(x)

*x is the exact result of the operation,  = 192 for single precision, 1536 for double, and xmax = 1.11 …11 ×  .

The inexact exception is raised when the result of a floating-point operation is not exact. In the  = 10, p = 3 system, 3.5  4.2 = 14.7 is exact, but 3.5  4.3 = 15.0 is not exact (since 3.5 · 4.3 = 15.05), and raises an inexact exception. Binary to Decimal Conversion discusses an algorithm that uses the inexact exception. A summary of the behavior of all five exceptions is given in TABLE D-4.

There is an implementation issue connected with the fact that the inexact exception is raised so often. If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive. This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system. When a user resets that status flag, the hardware mask is re-enabled.

#### Trap Handlers

One obvious use for trap handlers is for backward compatibility. Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process. This is especially useful for codes with a loop like `do` `S` `until` `(x` `>=` `100)`. Since comparing a NaN to a number with <, , >, , or = (but not ) always returns false, this code will go into an infinite loop if `x` ever becomes a NaN.

There is a more interesting use for trap handlers that comes up when computing products such as  that could potentially overflow. One solution is to use logarithms, and compute exp instead. The problem with this approach is that it is less accurate, and that it costs more than the simple expression  , even if there is no overflow. There is another solution using trap handlers called over/underflow counting that avoids both of these problems [Sterbenz 1974].

The idea is as follows. There is a global counter initialized to zero. Whenever the partial product  overflows for some k, the trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. In IEEE 754 single precision, emax = 127, so if pk = 1.45 × 2130, it will overflow and cause the trap handler to be called, which will wrap the exponent back into range, changing pkto 1.45 × 2-62 (see below). Similarly, if pk underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one. When all the multiplications are done, if the counter is zero then the final product is pn. If the counter is positive, the product overflowed, if the counter is negative, it underflowed. If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost. Even if there are over/underflows, the calculation is more accurate than if it had been computed with logarithms, because each pk was computed from pk – 1 using a full precision multiply. Barnett [1987] discusses a formula where the full accuracy of over/underflow counting turned up an error in earlier tables of that formula.

IEEE 754 specifies that when an overflow or underflow trap handler is called, it is passed the wrapped-around result as an argument. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided by 2, and then rounded to the relevant precision. For underflow, the result is multiplied by 2. The exponent  is 192 for single precision and 1536 for double precision. This is why 1.45 x 2130 was transformed into 1.45 × 2-62 in the example above.

#### Rounding Modes

In the IEEE standard, rounding occurs whenever an operation has a result that is not exact, since (with the exception of binary decimal conversion) each operation is computed exactly and then rounded. By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round toward +, and round toward –. When used with the convert to integer operation, round toward – causes the convert to become the floor function, while round toward + is ceiling. The rounding mode affects overflow, because when round toward 0 or round toward – is in effect, an overflow of positive magnitude causes the default result to be the largest representable number, not +. Similarly, overflows of negative magnitude will produce the largest negative number when round toward + or round toward 0 is in effect.

One application of rounding modes occurs in interval arithmetic (another is mentioned in Binary to Decimal Conversion). When using interval arithmetic, the sum of two numbers x and y is an interval  , where  is  y rounded toward –, and  is x  y rounded toward +. The exact result of the addition is contained within the interval  . Without rounding modes, interval arithmetic is usually implemented by computing  and  , where  is machine epsilon.21 This results in overestimates for the size of the intervals. Since the result of an operation in interval arithmetic is an interval, in general the input to an operation will also be an interval. If two intervals  , and  , are added, the result is  , where  is  with the rounding mode set to round toward –, and  is  with the rounding mode set to round toward +.

When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation. This is not very helpful if the interval turns out to be large (as it often does), since the correct answer could be anywhere in that interval. Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p. If interval arithmetic suggests that the final answer may be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size.

#### Flags

The IEEE standard has a number of flags and modes. As discussed above, there is one status flag for each of the five exceptions: underflow, overflow, division by zero, invalid operation and inexact. There are four rounding modes: round toward nearest, round toward +, round toward 0, and round toward –. It is strongly recommended that there be an enable mode bit for each of the five exceptions. This section gives some simple examples of how these modes and flags can be put to good use. A more sophisticated example is discussed in the section Binary to Decimal Conversion.

Consider writing a subroutine to compute xn, where n is an integer. When n > 0, a simple routine like

 ```PositivePower(x,n) { ``` ``` while (n is even) { ``` ``` x = x*x ``` ``` n = n/2 ``` ``` } ``` ``` u = x ``` ``` while (true) { ``` ``` n = n/2 ``` ``` if (n==0) return u ``` ``` x = x*x ``` ``` if (n is odd) u = u*x ``` ``` } ```

If n < 0, then a more accurate way to compute xn is not to call `PositivePower(1/x,` `-n)` but rather `1/PositivePower(x,` `-n)`, because the first expression multiplies n quantities each of which have a rounding error from the division (i.e., 1/x). In the second expression these are exact (i.e., x), and the final division commits just one additional rounding error. Unfortunately, these is a slight snag in this strategy. If `PositivePower(x,` `-n)`underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if xnunderflows, then xn will either overflow or be in range.22 But since the IEEE standard gives the user access to all the flags, the subroutine can easily correct for this. It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits. It then computes `1/PositivePower(x,` `-n)`. If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits. If one of the status bits is set, it restores the flags and redoes the calculation using `PositivePower(1/x,` `-n)`, which causes the correct exceptions to occur.

Another example of the use of flags occurs when computing arccos via the formula

arccos x = 2 arctan  .
If arctan() evaluates to /2, then arccos(-1) will correctly evaluate to 2·arctan() =, because of infinity arithmetic. However, there is a small snag, because the computation of (1 – x)/(1 + x) will cause the divide by zero exception flag to be set, even though arccos(-1) is not exceptional. The solution to this problem is straightforward. Simply save the value of the divide by zero flag before computing arccos, and then restore its old value after the computation.

## Systems Aspects

The design of almost every aspect of a computer system requires knowledge about floating-point. Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions. Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program.

 ```q = 3.0/7.0 ``` ```if q = 3.0/7.0 then print "Equal": ``` ``` else print "Not Equal" ```

When compiled and run using Borland’s Turbo Basic on an IBM PC, the program prints `Not` `Equal`! This example will be analyzed in the next section

Incidentally, some people think that the solution to such anomalies is never to compare floating-point numbers for equality, but instead to consider them equal if they are within some error bound E. This is hardly a cure-all because it raises as many questions as it answers. What should the value of E be? If x < 0 and y > 0 are within E, should they really be considered to be equal, even though they have different signs? Furthermore, the relation defined by this rule, a ~ b  |a – b| < E, is not an equivalence relation because a ~ b and b ~ c does not imply that ac.

### Instruction Sets

It is quite common for an algorithm to require a short burst of higher precision in order to produce accurate results. One example occurs in the quadratic formula ( )/2a. As discussed in the section Proof of Theorem 4, when b2  4ac, rounding error can contaminate up to half the digits in the roots computed with the quadratic formula. By performing the subcalculation of b2 – 4ac in double precision, half the double precision bits of the root are lost, which means that all the single precision bits are preserved.

The computation of b2 – 4ac in double precision when each of the quantities ab, and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result. In order to produce the exactly rounded product of two p-digit numbers, a multiplier needs to generate the entire 2p bits of product, although it may throw bits away as it proceeds. Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier. Despite this, modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands.23

If an instruction that combines two single precision operands to produce a double precision product was only useful for the quadratic formula, it wouldn’t be worth adding to an instruction set. However, this instruction has many other uses. Consider the problem of solving a system of linear equations,

a11x1 + a12x2 + · · · + a1nxn= b1
a21x1 + a22x2 + · · · + a2nxn= b2
· · ·
an1x1 + an2x2 + · · ·+ annxn= bn
which can be written in matrix form as Ax = b, where

Suppose that a solution x(1) is computed by some method, perhaps Gaussian elimination. There is a simple way to improve the accuracy of the result called iterative improvement. First compute

(12)  = Ax(1) – b
and then solve the system

(13) Ay =
Note that if x(1) is an exact solution, then  is the zero vector, as is y. In general, the computation of  and y will incur rounding error, so Ay    Ax(1) – b = A(x(1) – x), where x is the (unknown) true solution. Then  x(1) – x, so an improved estimate for the solution is

(14) x(2) = x(1) – y
The three steps (12)(13), and (14) can be repeated, replacing x(1) with x(2), and x(2) with x(3). This argument that x(i + 1) is more accurate than x(i) is only informal. For more information, see [Golub and Van Loan 1989].

When performing iterative improvement,  is a vector whose elements are the difference of nearby inexact floating-point numbers, and so can suffer from catastrophic cancellation. Thus iterative improvement is not very useful unless Ax(1) – b is computed in double precision. Once again, this is a case of computing the product of two single precision numbers (A and x(1)), where the full double precision result is needed.

To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set. Some of the implications of this for compilers are discussed in the next section.

### Languages and Compilers

The interaction of compilers and floating-point is discussed in Farnum [1988], and much of the discussion in this section is taken from that paper.

#### Ambiguity

Ideally, a language definition should define the semantics of the language precisely enough to prove statements about programs. While this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to floating-point. Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error. If so, the previous sections have demonstrated the fallacy in this reasoning. This section discusses some common grey areas in language definitions, including suggestions about how to deal with them.

Remarkably enough, some languages don’t clearly specify that if `x` is a floating-point variable (with say a value of `3.0/10.0`), then every occurrence of (say) `10.0*x` must have the same value. For example Ada, which is based on Brown’s model, seems to imply that floating-point arithmetic only has to satisfy Brown’s axioms, and thus expressions can have one of many possible values. Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined. In the IEEE model, we can prove that `(3.0/10.0)*10.0` evaluates to `3` (Theorem 7). In Brown’s model, we cannot.

Another ambiguity in most language definitions concerns what happens on overflow, underflow and other exceptions. The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point.

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression `(x+y)+z` has a totally different answer than `x+(y+z)` when x = 1030y = -1030 and z = 1 (it is 1 in the former case, 0 in the latter). The importance of preserving parentheses cannot be overemphasized. The algorithms presented in theorems 3, 4 and 6 all depend on it. For example, in Theorem 6, the formula xh = mx – (mx – x) would reduce to xh = x if it weren’t for parentheses, thereby destroying the entire algorithm. A language definition that does not require parentheses to be honored is useless for floating-point calculations.

Subexpression evaluation is imprecisely defined in many languages. Suppose that `ds` is double precision, but `x` and `y` are single precision. Then in the expression `ds` `+` `x*y` is the product performed in single or double precision? Another example: in `x` `+` `m/n` where `m` and `n` are integers, is the division an integer operation or a floating-point one? There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks. First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables. Another problem concerns constants. In the expression `0.1*x`, most languages interpret 0.1 to be a single precision constant. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision. If 0.1 is still treated as a single precision constant, then there will be a compile time error. The programmer will have to hunt down and change every floating-point constant.

The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided. There are a number of guiding examples. The original definition of C required that every floating-point expression be computed in double precision [Kernighan and Ritchie 1978]. This leads to anomalies like the example at the beginning of this section. The expression `3.0/7.0` is computed in double precision, but if `q` is a single-precision variable, the quotient is rounded to single precision for storage. Since 3/7 is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. Thus the comparison q = 3/7 fails. This suggests that computing every expression in the highest precision available is not a good rule.

Another guiding example is inner products. If the inner product has thousands of terms, the rounding error in the sum can become substantial. One way to reduce this rounding error is to accumulate the sums in double precision (this will be discussed in more detail in the section Optimizers). If `d` is a double precision variable, and `x[]` and `y[]` are single precision arrays, then the inner product loop will look like `d` `=` `d` `+``x[i]*y[i]`. If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable.

A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression. Then `q` `=` `3.0/7.0` will be computed entirely in single precision24 and will have the boolean value true, whereas `d` `=` `d` `+` `x[i]*y[i]` will be computed in double precision, gaining the full advantage of double precision accumulation. However, this rule is too simplistic to cover all cases cleanly. If `dx` and `dy` are double precision variables, the expression `y` `=` `x` `+` `single(dx-dy)` contains a double precision variable, but performing the sum in double precision would be pointless, because both operands are single precision, as is the result.

A more sophisticated subexpression evaluation rule is as follows. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree. Then perform a second pass from the root to the leaves. In this pass, assign to each operation the maximum of the tentative precision and the precision expected by the parent. In the case of `q` `=` `3.0/7.0`, every leaf is single precision, so all the operations are done in single precision. In the case of `d` `=` `d` `+` `x[i]*y[i]`, the tentative precision of the multiply operation is single precision, but in the second pass it gets promoted to double precision, because its parent operation expects a double precision operand. And in `y` `=` `x` `+` `single(dx-dy)`, the addition is done in single precision. Farnum [1988] presents evidence that this algorithm in not difficult to implement.

The disadvantage of this rule is that the evaluation of a subexpression depends on the expression in which it is embedded. This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression. You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in. A final comment on subexpressions: since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like `0.1` which are not exactly representable in binary.

Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. Unlike the basic arithmetic operations, the value of exponentiation is not always obvious [Kahan and Coonen 1982]. If `**` is the exponentiation operator, then `(-3)**3`certainly has the value -27. However, `(-3.0)**3.0` is problematical. If the `**` operator checks for integer powers, it would compute `(-3.0)**3.0` as -3.03 = -27. On the other hand, if the formula xy = eylogx is used to define `**` for real arguments, then depending on the log function, the result could be a NaN (using the natural definition of log(x) = `NaN` when x < 0). If the FORTRAN `CLOG` function is used however, then the answer will be -27, because the ANSI FORTRAN standard defines `CLOG(-3.0)` to be i + log 3 [ANSI 1978]. The programming language Ada avoids this problem by only defining exponentiation for integer powers, while ANSI FORTRAN prohibits raising a negative number to a real power.

In fact, the FORTRAN standard says that

Any arithmetic operation whose result is not mathematically defined is prohibited…

Unfortunately, with the introduction of ± by the IEEE standard, the meaning of not mathematically defined is no longer totally clear cut. One definition might be to use the method shown in section Infinity. For example, to determine the value of ab, consider non-constant analytic functions f and g with the property that f(x a and g(x b as x  0. If f(x)g(x) always approaches the same limit, then this should be the value of ab. This definition would set 2 =  which seems quite reasonable. In the case of 1.0, when f(x) = 1 and g(x) = 1/x the limit approaches 1, but when f(x) = 1 – x and g(x) = 1/x the limit is e-1. So 1.0, should be a NaN. In the case of 00f(x)g(x) = eg(x)log f(x). Since fand g are analytic and take on the value 0 at 0, f(x) = a1x1 + a2x2 + … and g(x) = b1x1 + b2x2 + …. Thus limx  0g(x) log f(x) = limx  0x log(x(a1 + a2x + …)) = limx  0x log(a1x) = 0. So f(x)g(x)  e0 = 1 for all f and g, which means that 00 = 1.25 26 Using this definition would unambiguously define the exponential function for all arguments, and in particular would define `(-3.0)**3.0` to be -27.

#### The IEEE Standard

The section The IEEE Standard,” discussed many of the features of the IEEE standard. However, the IEEE standard says nothing about how these features are to be accessed from a programming language. Thus, there is usually a mismatch between floating-point hardware that supports the standard and programming languages like C, Pascal or FORTRAN. Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is often implemented directly in hardware. This functionality is easily accessed via a library square root routine. However, other aspects of the standard are not so easily implemented as subroutines. For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions (although the recommended configurations are single plus single-extended or single, double, and double-extended). Infinity provides another example. Constants to represent ± could be supplied by a subroutine. But that might make them unusable in places that require constant expressions, such as the initializer of a constant variable.

A more subtle situation is manipulating the state associated with a computation, where the state consists of the rounding modes, trap enable bits, trap handlers and exception flags. One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful. As the examples in the section Flags show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine. Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored. Language support for setting the state precisely in the scope of a block would be very useful here. Modula-3 is one language that implements this idea for trap handlers [Nelson 1991].

There are a number of minor points that need to be considered when implementing the IEEE standard in a language. Since x – x = +0 for all x,27 (+0) – (+0) = +0. However, -(+0) = -0, thus –x should not be defined as 0 – x. The introduction of NaNs can be confusing, because a NaN is never equal to any other number (including another NaN), so x = x is no longer always true. In fact, the expression x  x is the simplest way to test for a NaN if the IEEE recommended function `Isnan` is not provided. Furthermore, NaNs are unordered with respect to all other numbers, so x  y cannot be defined as not x > y. Since the introduction of NaNs causes floating-point numbers to become partially ordered, a `compare`function that returns one of <, =, >, or unordered can make it easier for the programmer to deal with comparisons.

Although the IEEE standard defines the basic floating-point operations to return a NaN if any operand is a NaN, this might not always be the best definition for compound operations. For example when computing the appropriate scale factor to use in plotting a graph, the maximum of a set of values must be computed. In this case it makes sense for the max operation to simply ignore NaNs.

Finally, rounding can be a problem. The IEEE standard defines rounding very precisely, and it depends on the current value of the rounding modes. This sometimes conflicts with the definition of implicit rounding in type conversions or the explicit `round` function in languages. This means that programs which wish to use IEEE rounding can’t use the natural language primitives, and conversely the language primitives will be inefficient to implement on the ever increasing number of IEEE machines.

#### Optimizers

Compiler texts tend to ignore the subject of floating-point. For example Aho et al. [1986] mentions replacing `x/2.0` with `x*0.5`, leading the reader to assume that `x/10.0` should be replaced by `0.1*x`. However, these two expressions do not have the same semantics on a binary machine, because 0.1 cannot be represented exactly in binary. This textbook also suggests replacing `x*y-x*z` by `x*(y-z)`, even though we have seen that these two expressions can have quite different values when y  z. Although it does qualify the statement that any algebraic identity can be used when optimizing code by noting that optimizers should not violate the language definition, it leaves the impression that floating-point semantics are not very important. Whether or not the language standard specifies that parenthesis must be honored, `(x+y)+z` can have a totally different answer than `x+(y+z)`, as discussed above. There is a problem closely related to preserving parentheses that is illustrated by the following code

 ```eps = 1; ``` ```do eps = 0.5*eps; while (eps + 1 > 1); ```

:
This is designed to give an estimate for machine epsilon. If an optimizing compiler notices that eps + 1 > 1  eps > 0, the program will be changed completely. Instead of computing the smallest number x such that 1  x is still greater than (x  e   ), it will compute the largest number x for which x/2 is rounded to 0 (x   ). Avoiding this kind of “optimization” is so important that it is worth presenting one more very useful algorithm that is totally ruined by it.

Many problems, such as numerical integration and the numerical solution of differential equations involve computing sums with many terms. Because each addition can potentially introduce an error as large as .5 ulp, a sum involving thousands of terms can have quite a bit of rounding error. A simple way to correct for this is to store the partial summand in a double precision variable and to perform each addition using double precision. If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems. However, if the calculation is already being done in double precision, doubling the precision is not so simple. One method that is sometimes advocated is to sort the numbers and add them from smallest to largest. However, there is a much more efficient method which dramatically improves the accuracy of sums, namely

#### Theorem 8 (Kahan Summation Formula)

Suppose that  is computed using the following algorithm

 ```S = X[1]; ``` ```C = 0; ``` ```for j = 2 to N { ``` ``` Y = X[j] - C; ``` ``` T = S + Y; ``` ``` C = (T - S) - Y; ``` ``` S = T; ``` ```} ```
Then the computed sum S is equal to  where  .

Using the naive formula  , the computed sum is equal to  where |j| < (n – j)e. Comparing this with the error in the Kahan summation formula shows a dramatic improvement. Each summand is perturbed by only 2e, instead of perturbations as large as ne in the simple formula. Details are in, Errors In Summation.

An optimizer that believed floating-point arithmetic obeyed the laws of algebra would conclude that C = [TS] – Y = [(S+Y)-S] – Y = 0, rendering the algorithm completely useless. These examples can be summarized by saying that optimizers should be extremely cautious when applying algebraic identities that hold for the mathematical real numbers to expressions involving floating-point variables.

Another way that optimizers can change the semantics of floating-point code involves constants. In the expression `1.0E-40*x`, there is an implicit decimal to binary conversion operation that converts the decimal number to a binary constant. Because this constant cannot be represented exactly in binary, the inexact exception should be raised. In addition, the underflow flag should to be set if the expression is evaluated in single precision. Since the constant is inexact, its exact conversion to binary depends on the current value of the IEEE rounding modes. Thus an optimizer that converts `1.0E-40` to binary at compile time would be changing the semantics of the program. However, constants like 27.5 which are exactly representable in the smallest available precision can be safely converted at compile time, since they are always exact, cannot raise any exception, and are unaffected by the rounding modes. Constants that are intended to be converted at compile time should be done with a constant declaration, such as `const` `pi` `=` `3.14159265`.

Common subexpression elimination is another example of an optimization that can change floating-point semantics, as illustrated by the following code

 ```C = A*B; ``` ```RndMode = Up ``` ```D = A*B; ```

Although `A*B` can appear to be a common subexpression, it is not because the rounding mode is different at the two evaluation sites. Three final examples: x = x cannot be replaced by the boolean constant `true`, because it fails when x is a NaN; –x = 0 – x fails for x = +0; and x < y is not the opposite of x  y, because NaNs are neither greater than nor less than ordinary floating-point numbers.

Despite these examples, there are useful optimizations that can be done on floating-point code. First of all, there are algebraic identities that are valid for floating-point numbers. Some examples in IEEE arithmetic are x + y = y + x, 2 ×  x = x + x, 1 × x = x, and 0.5× x = x/2. However, even these simple identities can fail on a few machines such as CDC and Cray supercomputers. Instruction scheduling and in-line procedure substitution are two other potentially useful optimizations.28

As a final example, consider the expression `dx` `=` `x*y`, where `x` and `y` are single precision variables, and `dx` is double precision. On machines that have an instruction that multiplies two single precision numbers to produce a double precision number, `dx` `=` `x*y` can get mapped to that instruction, rather than compiled to a series of instructions that convert the operands to double and then perform a double to double precision multiply.

Some compiler writers view restrictions which prohibit converting (x + y) + z to x + (y + z) as irrelevant, of interest only to programmers who use unportable tricks. Perhaps they have in mind that floating-point numbers model real numbers and should obey the same laws that real numbers do. The problem with real number semantics is that they are extremely expensive to implement. Every time two n bit numbers are multiplied, the product will have 2n bits. Every time two n bit numbers with widely spaced exponents are added, the number of bits in the sum is n + the space between the exponents. The sum could have up to (emax – emin) + n bits, or roughly 2·emax + n bits. An algorithm that involves thousands of operations (such as solving a linear system) will soon be operating on numbers with many significant bits, and be hopelessly slow. The implementation of library functions such as sin and cos is even more difficult, because the value of these transcendental functions aren’t rational numbers. Exact integer arithmetic is often provided by lisp systems and is handy for some problems. However, exact floating-point arithmetic is rarely useful.

The fact is that there are useful algorithms (like the Kahan summation formula) that exploit the fact that (x + y) + z  x + (y + z), and work whenever the bound

b = (a + b)(1 + )
holds (as well as similar bounds for -, × and /). Since these bounds hold for almost all commercial hardware, it would be foolish for numerical programmers to ignore such algorithms, and it would be irresponsible for compiler writers to destroy these algorithms by pretending that floating-point variables have real number semantics.

### Exception Handling

The topics discussed up to now have primarily concerned systems implications of accuracy and precision. Trap handlers also raise some interesting systems issues. The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and the section Trap Handlers, gave some applications of user defined trap handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result. Depending on the programming language being used, the trap handler might be able to access other variables in the program as well. For all exceptions, the trap handler must be able to identify what operation was being performed and the precision of its destination.

The IEEE standard assumes that operations are conceptually serial and that when an interrupt occurs, it is possible to identify the operation and its operands. On machines which have pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter. Hardware support for identifying exactly which operation trapped may be necessary.

Another problem is illustrated by the following program fragment.

 ```x = y*z; ``` ```z = x*w; ``` ```a = b + c; ``` ```d = a/x; ```

Suppose the second multiply raises an exception, and the trap handler wants to use the value of `a`. On hardware that can do an add and multiply in parallel, an optimizer would probably move the addition operation ahead of the second multiply, so that the add can proceed in parallel with the first multiply. Thus when the second multiply traps, `a` `=` `b` `+` `c` has already been executed, potentially changing the result of `a`. It would not be reasonable for a compiler to avoid this kind of optimization, because every floating-point operation can potentially trap, and thus virtually all instruction scheduling optimizations would be eliminated. This problem can be avoided by prohibiting trap handlers from accessing any variables of the program directly. Instead, the handler can be given the operands or result as an argument.

But there are still problems. In the fragment

 ```x = y*z; ``` ```z = a + b; ```

the two instructions might well be executed in parallel. If the multiply traps, its argument `z` could already have been overwritten by the addition, especially since addition is usually faster than multiply. Computer systems that support the IEEE standard must provide some way to save the value of `z`, either in hardware or by having the compiler avoid such a situation in the first place.

W. Kahan has proposed using presubstitution instead of trap handlers to avoid these problems. In this method, the user specifies an exception and the value he wants to be used as the result when the exception occurs. As an example, suppose that in code for computing (sin x)/x, the user decides that x = 0 is so rare that it would improve performance to avoid a test for x = 0, and instead handle this case when a 0/0 trap occurs. Using IEEE trap handlers, the user would write a handler that returns a value of 1 and install it before computing sin x/x. Using presubstitution, the user would specify that when an invalid operation occurs, the value 1 should be used. Kahan calls this presubstitution, because the value to be used must be specified before the exception occurs. When using trap handlers, the value to be returned can be computed when the trap occurs.

The advantage of presubstitution is that it has a straightforward hardware implementation.29 As soon as the type of exception has been determined, it can be used to index a table which contains the desired result of the operation. Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers.

## The Details

A number of claims have been made in this paper concerning properties of floating-point arithmetic. We now proceed to show that floating-point is not black magic, but rather is a straightforward subject whose claims can be verified mathematically. This section is divided into three parts. The first part presents an introduction to error analysis, and provides the details for the section Rounding Error. The second part explores binary to decimal conversion, filling in some gaps from the section The IEEE Standard. The third part discusses the Kahan summation formula, which was used as an example in the section Systems Aspects.

### Rounding Error

In the discussion of rounding error, it was stated that a single guard digit is enough to guarantee that addition and subtraction will always be accurate (Theorem 2). We now proceed to verify this fact. Theorem 2 has two parts, one for subtraction and one for addition. The part for subtraction is

#### Theorem 9

If x and y are positive floating-point numbers in a format with parameters  and p, and if subtraction is done with p + 1 digits (i.e. one guard digit), then the relative rounding error in the result is less than

e  2e.

#### Proof

Interchange x and y if necessary so that x > y. It is also harmless to scale x and y so that x is represented by x0.x1 … xp – 1 × 0. If y is represented as y0.y1 … yp-1, then the difference is exact. If y is represented as 0.y1 … yp, then the guard digit ensures that the computed difference will be the exact difference rounded to a floating-point number, so the rounding error is at most e. In general, let y = 0.0 … 0yk + 1… yk + p and  be y truncated to p + 1 digits. Then

(15) y –  < ( – 1)(p – 1 + p – 2 +  p – k).

From the definition of guard digit, the computed value of x – is x –  rounded to be a floating-point number, that is, (x –  ) + , where the rounding error  satisfies

(16) | (/2)p.

The exact difference is x – y, so the error is (x – y) – (x –  + ) =  – y + . There are three cases. If x – y  1 then the relative error is bounded by

(17)   p [( – 1)(-1 +  + k) + /2] < p(1 + /2) .

Secondly, if x –  < 1, then  = 0. Since the smallest that x – y can be is

> (– 1)(-1 +  + k), where  =  – 1,

in this case the relative error is bounded by

(18)  .

The final case is when x – y < 1 but x –   1. The only way this could happen is if x –  = 1, in which case  = 0. But if  = 0, then (18)applies, so that again the relative error is bounded by p < p(1 + /2). z

When  = 2, the bound is exactly 2e, and this bound is achieved for x= 1 + 22 – p and y = 21 – p – 21 – 2p in the limit as p  . When adding numbers of the same sign, a guard digit is not necessary to achieve good accuracy, as the following result shows.

#### Theorem 10

If x  0 and y  0, then the relative error in computing x + y is at most 2, even if no guard digits are used.

#### Proof

The algorithm for addition with k guard digits is similar to that for subtraction. If x  y, shift y right until the radix points of x and y are aligned. Discard any digits shifted past the p + k position. Compute the sum of these two p + k digit numbers exactly. Then round to p digits.
We will verify the theorem when no guard digits are used; the general case is similar. There is no loss of generality in assuming that x  y  0 and that x is scaled to be of the form d.ddd × 0. First, assume there is no carry out. Then the digits shifted off the end of y have a value less than p + 1, and the sum is at least 1, so the relative error is less than p+1/1 = 2e. If there is a carry out, then the error from shifting must be added to the rounding error of

.
The sum is at least , so the relative error is less than

2z
It is obvious that combining these two theorems gives Theorem 2. Theorem 2 gives the relative error for performing one operation. Comparing the rounding error of x2 – y2 and (x + y) (x – y) requires knowing the relative error of multiple operations. The relative error of x  y is 1 = [(x  y) – (x – y)] / (x – y), which satisfies |1 2e. Or to write it another way

(19) x  y = (x – y) (1 + 1), |1 2e
Similarly

(20) x  y = (x + y) (1 + 2), |2 2e
Assuming that multiplication is performed by computing the exact product and then rounding, the relative error is at most .5 ulp, so

(21) u  v = uv (1 + 3), |3 e
for any floating-point numbers u and v. Putting these three equations together (letting u = x  y and v = x  y) gives

(22) (x  y (x  y) = (x – y) (1 + 1) (x + y) (1 + 2) (1 + 3)
So the relative error incurred when computing (x – y) (x + y) is

(23)
This relative error is equal to 1 + 2 + 3 + 12 + 13 + 23 + 123, which is bounded by 5 + 82. In other words, the maximum relative error is about 5 rounding errors (since e is a small number, e2 is almost negligible).

A similar analysis of (x  x (y  y) cannot result in a small value for the relative error, because when two nearby values of x and y are plugged into x2 – y2, the relative error will usually be quite large. Another way to see this is to try and duplicate the analysis that worked on (x  y (x  y), yielding

(x  x)  (y  y) = [x2(1 + 1) – y2(1 + 2)] (1 + 3)
= ((x2 – y2) (1 + 1) + (1 – 2)y2) (1 + 3)
When x and y are nearby, the error term (1 – 2)y2 can be as large as the result x2 – y2. These computations formally justify our claim that (x– y) (x + y) is more accurate than x2 – y2.

We next turn to an analysis of the formula for the area of a triangle. In order to estimate the maximum error that can occur when computing with (7), the following fact will be needed.

#### Theorem 11

If subtraction is performed with a guard digit, and y/2  x  2y, then x – y is computed exactly.

#### Proof

Note that if x and y have the same exponent, then certainly x  y is exact. Otherwise, from the condition of the theorem, the exponents can differ by at most 1. Scale and interchange x and y if necessary so that 0  y  x, and x is represented as x0.x1 … xp – 1 and y as 0.y1 … yp. Then the algorithm for computing x  y will compute x – y exactly and round to a floating-point number. If the difference is of the form 0.d1 …dp, the difference will already be p digits long, and no rounding is necessary. Since x  2yx – y  y, and since y is of the form 0.d1 … dp, so is x – yz

When  > 2, the hypothesis of Theorem 11 cannot be replaced by y/ y; the stronger condition y/2  x  2y is still necessary. The analysis of the error in (x – y) (x + y), immediately following the proof of Theorem 10, used the fact that the relative error in the basic operations of addition and subtraction is small (namely equations (19) and (20)). This is the most common kind of error analysis. However, analyzing formula (7) requires something more, namely Theorem 11, as the following proof will show.

#### Theorem 12

If subtraction uses a guard digit, and if a,b and c are the sides of a triangle (a  b c), then the relative error in computing (a + (b + c))(c – (a – b))(c + (a – b))(a +(b – c)) is at most 16, provided e < .005.

#### Proof

Let’s examine the factors one by one. From Theorem 10, b  c = (b + c) (1 + 1), where 1 is the relative error, and |1 2. Then the value of the first factor is

(a  (b  c)) = (a + (b  c)) (1 + 2) = (a + (b + c) (1 + 1))(1 + 2),

and thus

(a + b + c) (1 – 2)2  [a + (b + c) (1 – 2)] · (1-2)
a  (b  c)
[a + (b + c) (1 + 2)] (1 + 2)
(a + b + c) (1 + 2)2

This means that there is an 1 so that

(24) (a  (b  c)) = (a + b + c) (1 + 1)2, |1 2.

The next term involves the potentially catastrophic subtraction of c and a   `b`, because a  b may have rounding error. Because a, b and c are the sides of a triangle, a  b+ c, and combining this with the ordering c  b  a gives a  b + c  2b  2a. So a – b satisfies the conditions of Theorem 11. This means that a – b = a  b is exact, hence c  (a – b) is a harmless subtraction which can be estimated from Theorem 9 to be

(25) (c  (a  b)) = (c – (a – b)) (1 + 2), |2 2

The third term is the sum of two exact positive quantities, so

(26) (c  (a  b)) = (c + (a – b)) (1 + 3), |3 2

Finally, the last term is

(27) (a  (b  c)) = (a + (b – c)) (1 + 4)2, |4 2,

using both Theorem 9 and Theorem 10. If multiplication is assumed to be exactly rounded, so that x  y = xy(1 + ) with | , then combining (24)(25)(26) and (27) gives

(a  (b  c)) (c  (a  b)) (c  (a  b)) ( (b  c))
(a + (b + c)) (c – (a – b)) (c + (a – b)) (a + (b – c)) E

where

E = (1 + 1)2 (1 + 2) (1 + 3) (1 +4)(1 + 1)(1 + 2) (1 + 3)

An upper bound for E is (1 + 2)6(1 + )3, which expands out to 1 + 15 + O(2). Some writers simply ignore the O(e2) term, but it is easy to account for it. Writing (1 + 2)6(1 + )3 = 1 + 15 + R(), R() is a polynomial in e with positive coefficients, so it is an increasing function of . Since R(.005) = .505, R() < 1 for all  < .005, and hence E  (1 + 2)6(1 + )3 < 1 + 16. To get a lower bound on E, note that 1 – 15 – R() < E, and so when  < .005, 1 – 16 < (1 – 2)6(1 – )3. Combining these two bounds yields 1 – 16 < E < 1 + 16. Thus the relative error is at most 16. z

Theorem 12 certainly shows that there is no catastrophic cancellation in formula (7). So although it is not necessary to show formula (7) is numerically stable, it is satisfying to have a bound for the entire formula, which is what Theorem 3 of Cancellation gives.

#### Proof of Theorem 3

Let

q = (a + (b + c)) (c – (a – b)) (c + (a – b)) (a + (b – c))

and

Q = (a  (b  c))  (c  (a  b))  (c  (a  b))  (a  (b  c)).

Then, Theorem 12 shows that Q = q(1 + ), with   16. It is easy to check that

(28)

provided   .04/(.52)2  .15, and since | 16  16(.005) = .08,  does satisfy the condition. Thus

,

with |1| .52|| 8.5. If square roots are computed to within .5 ulp, then the error when computing  is (1 + 1)(1 + 2), with |2| . If  = 2, then there is no further error committed when dividing by 4. Otherwise, one more factor 1 + 3 with |3  is necessary for the division, and using the method in the proof of Theorem 12, the final error bound of (1 +1) (1 + 2) (1 + 3) is dominated by 1 + 4, with |4 11z

To make the heuristic explanation immediately following the statement of Theorem 4 precise, the next theorem describes just how closely µ(x) approximates a constant.

#### Theorem 13

If µ(x) = ln(1 + x)/x, then for 0  x   ,   µ(x)  1 and the derivative satisfies |µ(x)|   .

#### Proof

Note that µ(x) = 1 – x/2 + x2/3 – … is an alternating series with decreasing terms, so for x  1, µ(x 1 – x/2  1/2. It is even easier to see that because the series for µ is alternating, µ(x 1. The Taylor series of µ'(x) is also alternating, and if x   has decreasing terms, so –  µ'(x – + 2x/3, or –  µ'(x 0, thus |µ'(x)|   z

#### Proof of Theorem 4

Since the Taylor series for ln

is an alternating series, 0 < x – ln(1 + x) < x2/2, the relative error incurred when approximating ln(1 + x) by x is bounded by x/2. If 1  x = 1, then |x| < , so the relative error is bounded by /2.
When 1  x  1, define  via 1  x = 1 +  . Then since 0  x < 1, (1  x 1 =  . If division and logarithms are computed to within  ulp, then the computed value of the expression ln(1 + x)/((1 + x) – 1) is

(29)  (1 + 1) (1 + 2) =  (1 + 1) (1 + 2) = µ( ) (1 + 1) (1 + 2)

where |1  and |2 . To estimate µ( ), use the mean value theorem, which says that

(30) µ( ) – µ(x) = ( – x)µ’()

for some  between x and  . From the definition of  , it follows that | – x , and combining this with Theorem 13 gives |µ( ) – µ(x)|  /2, or |µ( )/µ(x) – 1|  /(2|µ(x)|)   which means that µ( ) = µ(x) (1 + 3), with |3 . Finally, multiplying by x introduces a final 4, so the computed value of

x·ln(1  x)/((1  x 1)
is

It is easy to check that if  < 0.1, then

(1 + 1) (1 + 2) (1 + 3) (1 + 4) = 1 + ,
with | 5z

An interesting example of error analysis using formulas (19)(20), and (21) occurs in the quadratic formula  . The section Cancellation, explained how rewriting the equation will eliminate the potential cancellation caused by the ± operation. But there is another potential cancellation that can occur when computing d = b2 – 4ac. This one cannot be eliminated by a simple rearrangement of the formula. Roughly speaking, when b2  4ac, rounding error can contaminate up to half the digits in the roots computed with the quadratic formula. Here is an informal proof (another approach to estimating the error in the quadratic formula appears in Kahan [1972]).

If b2  4ac, rounding error can contaminate up to half the digits in the roots computed with the quadratic formula  .

Proof: Write (b  b)  (4a  c) = (b2(1 + 1) – 4ac(1 + 2)) (1 + 3), where |i| . 30 Using d = b2 – 4ac, this can be rewritten as (d(1 + 1) – 4ac(2 – 1)) (1 + 3). To get an estimate for the size of this error, ignore second order terms in i, in which case the absolute error is d(1 + 3) – 4ac4, where |4| = |1 – 2 2. Since  , the first term d(1 + 3) can be ignored. To estimate the second term, use the fact that ax2 + bx + c = a(x – r1) (x – r2), so ar1r2 = c. Since b2  4ac, then r1  r2, so the second error term is  . Thus the computed value of  is

.
The inequality

shows that

,
where

,
so the absolute error in  a is about  . Since 4  p , and thus the absolute error of  destroys the bottom half of the bits of the roots r1  r2. In other words, since the calculation of the roots involves computing with  , and this expression does not have meaningful bits in the position corresponding to the lower order half of ri, then the lower order bits of ri cannot be meaningful. z

Finally, we turn to the proof of Theorem 6. It is based on the following fact, which is proven in the section Theorem 14 and Theorem 8.

#### Theorem 14

Let 0 < k < p, and set m = k + 1, and assume that floating-point operations are exactly rounded. Then (m  x)  (m  x  x) is exactly equal to x rounded to p – k significant digits. More precisely, x is rounded by taking the significand of x, imagining a radix point just left of the k least significant digits and rounding to an integer.

#### Proof of Theorem 6

By Theorem 14, xh is x rounded to p – k =  places. If there is no carry out, then certainly xh can be represented with  significant digits. Suppose there is a carry-out. If x = x0.x1 … xp – 1 × e, then rounding adds 1 to xp – k – 1, and the only way there can be a carry-out is if xp – k – 1 =  – 1, but then the low order digit of xh is 1 + xp – k– 1 = 0, and so again xh is representable in  digits.
To deal with xl, scale x to be an integer satisfying p – 1  x  p – 1. Let  where  is the p – k high order digits of x, and  is the k low order digits. There are three cases to consider. If  , then rounding x to p – k places is the same as chopping and  , and  . Since  has at most k digits, if p is even, then  has at most k =  = digits. Otherwise,  = 2 and  is representable with k – 1   significant bits. The second case is when  , and then computing xh involves rounding up, so xh =  + k, and xl = x – xh = x – k =  – k. Once again,  has at most k digits, so is representable with p/2 digits. Finally, if  = (/2)k – 1, then xh =  or   + kdepending on whether there is a round up. So xl is either (/2)k – 1 or (/2)k – 1 – k = –k/2, both of which are represented with 1 digit. z

Theorem 6 gives a way to express the product of two working precision numbers exactly as a sum. There is a companion formula for expressing a sum exactly. If |x| |y| then x + y = (x  y) + (x  (x  y))  y [Dekker 1971; Knuth 1981, Theorem C in section 4.2.2]. However, when using exactly rounded operations, this formula is only true for  = 2, and not for  = 10 as the example x = .99998, y = .99997 shows.

### Binary to Decimal Conversion

Since single precision has p = 24, and 224 < 108, you might expect that converting a binary number to 8 decimal digits would be sufficient to recover the original binary number. However, this is not the case.

#### Theorem 15

When a binary IEEE single precision number is converted to the closest eight digit decimal number, it is not always possible to uniquely recover the binary number from the decimal one. However, if nine decimal digits are used, then converting the decimal number to the closest binary number will recover the original floating-point number.

#### Proof

Binary single precision numbers lying in the half open interval [103, 210) = [1000, 1024) have 10 bits to the left of the binary point, and 14 bits to the right of the binary point. Thus there are (210 – 103)214 = 393,216 different binary numbers in that interval. If decimal numbers are represented with 8 digits, then there are (210 – 103)104 = 240,000 decimal numbers in the same interval. There is no way that 240,000 decimal numbers could represent 393,216 different binary numbers. So 8 decimal digits are not enough to uniquely represent each single precision binary number.
To show that 9 digits are sufficient, it is enough to show that the spacing between binary numbers is always greater than the spacing between decimal numbers. This will ensure that for each decimal number N, the interval

[N –  ulp, N +  ulp]

contains at most one binary number. Thus each binary number rounds to a unique decimal number which in turn rounds to a unique binary number.
To show that the spacing between binary numbers is always greater than the spacing between decimal numbers, consider an interval [10n, 10n+ 1]. On this interval, the spacing between consecutive decimal numbers is 10(n + 1) – 9. On [10n, 2m], where m is the smallest integer so that 10n < 2m, the spacing of binary numbers is 2m – 24, and the spacing gets larger further on in the interval. Thus it is enough to check that 10(n + 1) – 9 < 2m – 24. But in fact, since 10n < 2m, then 10(n + 1) – 9 = 10n10-8 < 2m10-8 < 2m2-24z

The same argument applied to double precision shows that 17 decimal digits are required to recover a double precision number.

Binary-decimal conversion also provides another example of the use of flags. Recall from the section Precision, that to recover a binary number from its decimal expansion, the decimal to binary conversion must be computed exactly. That conversion is performed by multiplying the quantities N and 10|P| (which are both exact if p < 13) in single-extended precision and then rounding this to single precision (or dividing if p < 0; both cases are similar). Of course the computation of N · 10|P| cannot be exact; it is the combined operation round(N · 10|P|) that must be exact, where the rounding is from single-extended to single precision. To see why it might fail to be exact, take the simple case of  = 10, p = 2 for single, and p = 3 for single-extended. If the product is to be 12.51, then this would be rounded to 12.5 as part of the single-extended multiply operation. Rounding to single precision would give 12. But that answer is not correct, because rounding the product to single precision should give 13. The error is due to double rounding.

By using the IEEE flags, double rounding can be avoided as follows. Save the current value of the inexact flag, and then reset it. Set the rounding mode to round-to-zero. Then perform the multiplication N · 10|P|. Store the new value of the inexact flag in `ixflag`, and restore the rounding mode and inexact flag. If `ixflag` is 0, then N · 10|P| is exact, so round(N · 10|P|) will be correct down to the last bit. If `ixflag` is 1, then some digits were truncated, since round-to-zero always truncates. The significand of the product will look like 1.b1b22b23b31. A double rounding error may occur if b23 …b31 = 10…0. A simple way to account for both cases is to perform a logical `OR` of `ixflag` with b31. Then round(N· 10|P|) will be computed correctly in all cases.

### Errors In Summation

The section Optimizers, mentioned the problem of accurately computing very long sums. The simplest approach to improving accuracy is to double the precision. To get a rough estimate of how much doubling the precision improves the accuracy of a sum, let s1 = x1s2 = s1  x2…, sisi – 1  xi. Then si = (1 + i) (si – 1 + xi), where i  , and ignoring second order terms in i gives

(31)
The first equality of (31) shows that the computed value of  is the same as if an exact summation was performed on perturbed values of xj. The first term x1 is perturbed by n, the last term xn by only . The second equality in (31) shows that error term is bounded by  . Doubling the precision has the effect of squaring . If the sum is being done in an IEEE double precision format, 1/  1016, so that  for any reasonable value of n. Thus, doubling the precision takes the maximum perturbation of n and changes it to  . Thus the 2 error bound for the Kahan summation formula (Theorem 8) is not as good as using double precision, even though it is much better than single precision.

For an intuitive explanation of why the Kahan summation formula works, consider the following diagram of the procedure.

Each time a summand is added, there is a correction factor C which will be applied on the next loop. So first subtract the correction C computed in the previous loop from Xj, giving the corrected summand Y. Then add this summand to the running sum S. The low order bits of Y (namely Yl) are lost in the sum. Next compute the high order bits of Y by computing T – S. When Y is subtracted from this, the low order bits of Y will be recovered. These are the bits that were lost in the first sum in the diagram. They become the correction factor for the next loop. A formal proof of Theorem 8, taken from Knuth [1981] page 572, appears in the section Theorem 14 and Theorem 8.”

## Summary

It is not uncommon for computer system designers to neglect the parts of a system related to floating-point. This is probably due to the fact that floating-point is given very little (if any) attention in the computer science curriculum. This in turn has caused the apparently widespread belief that floating-point is not a quantifiable subject, and so there is little point in fussing over the details of hardware and software that deal with it.

This paper has demonstrated that it is possible to reason rigorously about floating-point. For example, floating-point algorithms involving cancellation can be proven to have small relative errors if the underlying hardware has a guard digit, and there is an efficient algorithm for binary-decimal conversion that can be proven to be invertible, provided that extended precision is supported. The task of constructing reliable floating-point software is made much easier when the underlying computer system is supportive of floating-point. In addition to the two examples just mentioned (guard digits and extended precision), the section Systems Aspects of this paper has examples ranging from instruction set design to compiler optimization illustrating how to better support floating-point.

The increasing acceptance of the IEEE floating-point standard means that codes that utilize features of the standard are becoming ever more portable. The section The IEEE Standard, gave numerous examples illustrating how the features of the IEEE standard can be used in writing practical floating-point codes.

## Acknowledgments

This article was inspired by a course given by W. Kahan at Sun Microsystems from May through July of 1988, which was very ably organized by David Hough of Sun. My hope is to enable others to learn about the interaction of floating-point and computer systems without having to get up in time to attend 8:00 a.m. lectures. Thanks are due to Kahan and many of my colleagues at Xerox PARC (especially John Gilbert) for reading drafts of this paper and providing many useful comments. Reviews from Paul Hilfinger and an anonymous referee also helped improve the presentation.

## References

Aho, Alfred V., Sethi, R., and Ullman J. D. 1986. Compilers: Principles, Techniques and Tools, Addison-Wesley, Reading, MA.

ANSI 1978. American National Standard Programming Language FORTRAN, ANSI Standard X3.9-1978, American National Standards Institute, New York, NY.

Barnett, David 1987. A Portable Floating-Point Environment, unpublished manuscript.

Brown, W. S. 1981. A Simple but Realistic Model of Floating-Point Computation, ACM Trans. on Math. Software 7(4), pp. 445-480.

Cody, W. J et. al. 1984. A Proposed Radix- and Word-length-independent Standard for Floating-point Arithmetic, IEEE Micro 4(4), pp. 86-100.

Cody, W. J. 1988. Floating-Point Standards — Theory and Practice, in “Reliability in Computing: the role of interval methods in scientific computing”, ed. by Ramon E. Moore, pp. 99-107, Academic Press, Boston, MA.

Coonen, Jerome 1984. Contributions to a Proposed Standard for Binary Floating-Point Arithmetic, PhD Thesis, Univ. of California, Berkeley.

Dekker, T. J. 1971. A Floating-Point Technique for Extending the Available Precision, Numer. Math. 18(3), pp. 224-242.

Demmel, James 1984. Underflow and the Reliability of Numerical Software, SIAM J. Sci. Stat. Comput. 5(4), pp. 887-919.

Farnum, Charles 1988. Compiler Support for Floating-point Computation, Software-Practice and Experience, 18(7), pp. 701-709.

Forsythe, G. E. and Moler, C. B. 1967. Computer Solution of Linear Algebraic Systems, Prentice-Hall, Englewood Cliffs, NJ.

Goldberg, I. Bennett 1967. 27 Bits Are Not Enough for 8-Digit Accuracy, Comm. of the ACM. 10(2), pp 105-106.

Goldberg, David 1990. Computer Arithmetic, in “Computer Architecture: A Quantitative Approach”, by David Patterson and John L. Hennessy, Appendix A, Morgan Kaufmann, Los Altos, CA.

Golub, Gene H. and Van Loan, Charles F. 1989. Matrix Computations, 2nd edition,The Johns Hopkins University Press, Baltimore Maryland.

Graham, Ronald L. , Knuth, Donald E. and Patashnik, Oren. 1989. Concrete Mathematics, Addison-Wesley, Reading, MA, p.162.

Hewlett Packard 1982. HP-15C Advanced Functions Handbook.

IEEE 1987. IEEE Standard 754-1985 for Binary Floating-point Arithmetic, IEEE, (1985). Reprinted in SIGPLAN 22(2) pp. 9-25.

Kahan, W. 1972. A Survey Of Error Analysis, in Information Processing 71, Vol 2, pp. 1214 – 1239 (Ljubljana, Yugoslavia), North Holland, Amsterdam.

Kahan, W. 1986. Calculating Area and Angle of a Needle-like Triangle, unpublished manuscript.

Kahan, W. 1987. Branch Cuts for Complex Elementary Functions, in “The State of the Art in Numerical Analysis”, ed. by M.J.D. Powell and A. Iserles (Univ of Birmingham, England), Chapter 7, Oxford University Press, New York.

Kahan, W. 1988. Unpublished lectures given at Sun Microsystems, Mountain View, CA.

Kahan, W. and Coonen, Jerome T. 1982. The Near Orthogonality of Syntax, Semantics, and Diagnostics in Numerical Programming Environments, in “The Relationship Between Numerical Computation And Programming Languages”, ed. by J. K. Reid, pp. 103-115, North-Holland, Amsterdam.

Kahan, W. and LeBlanc, E. 1985. Anomalies in the IBM Acrith Package, Proc. 7th IEEE Symposium on Computer Arithmetic (Urbana, Illinois), pp. 322-331.

Kernighan, Brian W. and Ritchie, Dennis M. 1978. The C Programming Language, Prentice-Hall, Englewood Cliffs, NJ.

Kirchner, R. and Kulisch, U. 1987. Arithmetic for Vector Processors, Proc. 8th IEEE Symposium on Computer Arithmetic (Como, Italy), pp. 256-269.

Knuth, Donald E., 1981. The Art of Computer Programming, Volume II, Second Edition, Addison-Wesley, Reading, MA.

Kulisch, U. W., and Miranker, W. L. 1986. The Arithmetic of the Digital Computer: A New Approach, SIAM Review 28(1), pp 1-36.

Matula, D. W. and Kornerup, P. 1985. Finite Precision Rational Arithmetic: Slash Number Systems, IEEE Trans. on Comput. C-34(1), pp 3-18.

Nelson, G. 1991. Systems Programming With Modula-3, Prentice-Hall, Englewood Cliffs, NJ.

Reiser, John F. and Knuth, Donald E. 1975. Evading the Drift in Floating-point Addition, Information Processing Letters 3(3), pp 84-87.

Sterbenz, Pat H. 1974. Floating-Point Computation, Prentice-Hall, Englewood Cliffs, NJ.

Swartzlander, Earl E. and Alexopoulos, Aristides G. 1975. The Sign/Logarithm Number System, IEEE Trans. Comput. C-24(12), pp. 1238-1242.

Walther, J. S., 1971. A unified algorithm for elementary functions, Proceedings of the AFIP Spring Joint Computer Conf. 38, pp. 379-385.

## Theorem 14 and Theorem 8

This section contains two of the more technical proofs that were omitted from the text.

### Theorem 14

Let 0 < k < p, and set m = k + 1, and assume that floating-point operations are exactly rounded. Then ( x)  ( x  x) is exactly equal to x rounded to p – k significant digits. More precisely, x is rounded by taking the significand of x, imagining a radix point just left of the k least significant digits, and rounding to an integer.

### Proof

The proof breaks up into two cases, depending on whether or not the computation of mx = kx + x has a carry-out or not.
Assume there is no carry out. It is harmless to scale x so that it is an integer. Then the computation of mx = x + kx looks like this:

```aa...aabb...bb + ```aa…aabb…bb
`zz...zzbb...bb`
where x has been partitioned into two parts. The low order k digits are marked `b` and the high order p – k digits are marked `a`. To compute m  xfrom mx involves rounding off the low order k digits (the ones marked with `b`) so

(32) m  x = mx – x mod(k) + rk

The value of r is 1 if `.bb...b` is greater than  and 0 otherwise. More precisely

(33) r = 1 if `a.bb...b` rounds to a + 1, r = 0 otherwise.

Next compute m  x – x = mx – x mod(k) + rk – x = k(x + r) – mod(k). The picture below shows the computation of m  x – x rounded, that is, (m  x x. The top line is k(x + r), where `B` is the digit that results from adding `r` to the lowest order digit `b`.

```aa...aabb...bB00...00 - ```bb…bb
`zz... zzZ00...00`
If `.bb...b` <  then r = 0, subtracting causes a borrow from the digit marked `B`, but the difference is rounded up, and so the net effect is that the rounded difference equals the top line, which is kx. If `.bb...b` >  then r = 1, and 1 is subtracted from `B` because of the borrow, so the result is kx. Finally consider the case `.bb...b` =  . If r = 0 then `B` is even, `Z` is odd, and the difference is rounded up, giving kx. Similarly when r = 1, `B` is odd, `Z` is even, the difference is rounded down, so again the difference is kx. To summarize

(34) (m  x x = kx

Combining equations (32) and (34) gives (m  x) – (m  x  x) = x – x mod(k) + ·k. The result of performing this computation is

```r00...00 + aa...aabb...bb - ```bb…bb
`aa...aA00...00`
The rule for computing r, equation (33), is the same as the rule for rounding `a...` `ab...b` to p – k places. Thus computing mx – (mx – x) in floating-point arithmetic precision is exactly equal to rounding x to p – k places, in the case when x + kx does not carry out.
When x + kx does carry out, then mx = kx + x looks like this:

```aa...aabb...bb + ```aa…aabb…bb
`zz...zZbb...bb`
Thus, m  x = mx – x mod(k) + wk, where w = –Z if Z < /2, but the exact value of w is unimportant. Next, m  x – x = kx – x mod(k) + wk. In a picture

```aa...aabb...bb00...00 - bb... bb + ```w
`zz ... zZbb ...bb31`
Rounding gives (m  x x = kx + wk – rk, where r = 1 if `.bb...b` >  or if `.bb...b` =  and b0 = 1.32 Finally,

(m  x) – (m  x  x) = mx – x mod(k) + wk – (kx + wk – rk)
x – x mod(k) + rk.

And once again, r = 1 exactly when rounding `a...ab...b` to p – k places involves rounding up. Thus Theorem 14 is proven in all cases. z

#### Theorem 8 (Kahan Summation Formula)

Suppose that  is computed using the following algorithm

 ```S = X [1]; ``` ```C = 0; ``` ```for j = 2 to N { ``` ```Y = X [j] - C; ``` ``` T = S + Y; ``` ``` C = (T - S) - Y; ``` ``` S = T; ``` ```} ```
Then the computed sum S is equal to S =  xj (1 + j) + O(N2 |xj|, where |j 2.

#### Proof

First recall how the error estimate for the simple formula  xi went. Introduce s1 = x1si = (1 + i) (si – 1 + xi). Then the computed sum is sn, which is a sum of terms, each of which is an xi multiplied by an expression involving j‘s. The exact coefficient of x1 is (1 + 2)(1 + 3) … (1 + n), and so by renumbering, the coefficient of x2 must be (1 + 3)(1 + 4) … (1 + n), and so on. The proof of Theorem 8 runs along exactly the same lines, only the coefficient of x1 is more complicated. In detail s0 = c0 = 0 and

y= xk  ck – 1 = (xk – ck – 1) (1 + k)
s= sk – 1  y(sk-1 + yk) (1 + k)
c(sk  sk – 1)  yk= [(sk – sk – 1) (1 + k) – yk] (1 + k)
where all the Greek letters are bounded by . Although the coefficient of x1 in sk is the ultimate expression of interest, in turns out to be easier to compute the coefficient of x1 in sk – ck and ck.
When k = 1,

c1 = (s1(1 + 1) – y1) (1 + d1)
y1((1 + s1) (1 + 1) – 1) (1 + d1)
x1(s1 +1 + s1g1) (1 + d1) (1 + h1)
s1 – c1 = x1[(1 + s1) – (s1 + g1 + s1g1) (1 + d1)](1 + h1)
x1[1 – g1 – s1d1 – s1g1 – d1g1 – s1g1d1](1 + h1)
Calling the coefficients of x1 in these expressions Ck and Sk respectively, then

C= 2 + O(2)
S= + 1 – 1 + 42 + O(3)

To get the general formula for Sk and Ck, expand the definitions of sk and ck, ignoring all terms involving xi with i > 1 to get

s= (sk – 1 + yk)(1 + k)
= [sk – 1 + (xk – ck – 1) (1 + k)](1 + k)
= [(sk – 1 – ck – 1) – kck – 1](1+k)
ck = [{sk – sk – 1}(1 + k) – yk](1 + k)
= [{((sk – 1 – ck – 1) – kck – 1)(1 + k) – sk – 1}(1 + k) + ck – 1(1 + k)](1 + k)
= [{(sk – 1 – ck – 1)k – kck-1(1 + k) – ck – 1}(1 + k) + ck – 1(1 + k)](1 + k)
= [(sk – 1 – ck – 1)k(1 + k) – ck – 1(k + k(k + k + kk))](1 + k),
sk – ck = ((sk – 1 – ck – 1) – kck – 1) (1 + k)
– [(sk – 1 – ck – 1)k(1 + k) – ck – 1(k + k(k + k + kk)](1 + k)
= (sk- 1 – ck – 1)((1 + k) – k(1 + k)(1 + k))
ck – 1(-k(1 + k) + (k + k(k + k + kk)) (1 + k))
= (s– 1 – ck – 1) (1 – k(k + k + kk))
ck – 1 – [k + k + k(k + kk) + (k + k(k + k + kk))k]
Since Sk and Ck are only being computed up to order 2, these formulas can be simplified to

Ck= (k + O(2))Sk – 1 + (-k + O(2))Ck – 1
Sk= ((1 + 22 + O(3))Sk – 1 + (2 + (2))Ck – 1
Using these formulas gives

C2 = 2 + O(2)
S= 1 + 1 – 1 + 102 + O(3)

and in general it is easy to check by induction that

Ck + O(2)
S= 1 + 1 – 1 + (4k+2)2 + O(3)
Finally, what is wanted is the coefficient of x1 in sk. To get this value, let xn + 1 = 0, let all the Greek letters with subscripts of n + 1 equal 0, and compute sn + 1. Then sn + 1 = sn – cn, and the coefficient of x1 in sn is less than the coefficient in sn + 1, which is Sn = 1 + 1 – 1 + (4n + 2)2 = (1 + 2 + (n2)). z

## Differences Among IEEE 754 Implementations

Note – This section is not part of the published paper. It has been added to clarify certain points and correct possible misconceptions about the IEEE standard that the reader might infer from the paper. This material was not written by David Goldberg, but it appears here with his permission.

The preceding paper has shown that floating-point arithmetic must be implemented carefully, since programmers may depend on its properties for the correctness and accuracy of their programs. In particular, the IEEE standard requires a careful implementation, and it is possible to write useful programs that work correctly and deliver accurate results only on systems that conform to the standard. The reader might be tempted to conclude that such programs should be portable to all IEEE systems. Indeed, portable software would be easier to write if the remark “When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic,” were true.

Unfortunately, the IEEE standard does not guarantee that the same program will deliver identical results on all conforming systems. Most programs will actually produce different results on different systems for a variety of reasons. For one, most programs involve the conversion of numbers between decimal and binary formats, and the IEEE standard does not completely specify the accuracy with which such conversions must be performed. For another, many programs use elementary functions supplied by a system library, and the standard doesn’t specify these functions at all. Of course, most programmers know that these features lie beyond the scope of the IEEE standard.

Many programmers may not realize that even a program that uses only the numeric formats and operations prescribed by the IEEE standard can compute different results on different systems. In fact, the authors of the standard intended to allow different implementations to obtain different results. Their intent is evident in the definition of the term destination in the IEEE 754 standard: “A destination may be either explicitly designated by the user or implicitly supplied by the system (for example, intermediate results in subexpressions or arguments for procedures). Some languages place the results of intermediate calculations in destinations beyond the user’s control. Nonetheless, this standard defines the result of an operation in terms of that destination’s format and the operands’ values.” (IEEE 754-1985, p. 7) In other words, the IEEE standard requires that each result be rounded correctly to the precision of the destination into which it will be placed, but the standard does not require that the precision of that destination be determined by a user’s program. Thus, different systems may deliver their results to destinations with different precisions, causing the same program to produce different results (sometimes dramatically so), even though those systems all conform to the standard.

Several of the examples in the preceding paper depend on some knowledge of the way floating-point arithmetic is rounded. In order to rely on examples such as these, a programmer must be able to predict how a program will be interpreted, and in particular, on an IEEE system, what the precision of the destination of each arithmetic operation may be. Alas, the loophole in the IEEE standard’s definition of destinationundermines the programmer’s ability to know how a program will be interpreted. Consequently, several of the examples given above, when implemented as apparently portable programs in a high-level language, may not work correctly on IEEE systems that normally deliver results to destinations with a different precision than the programmer expects. Other examples may work, but proving that they work may lie beyond the average programmer’s ability.

In this section, we classify existing implementations of IEEE 754 arithmetic based on the precisions of the destination formats they normally use. We then review some examples from the paper to show that delivering results in a wider precision than a program expects can cause it to compute wrong results even though it is provably correct when the expected precision is used. We also revisit one of the proofs in the paper to illustrate the intellectual effort required to cope with unexpected precision even when it doesn’t invalidate our programs. These examples show that despite all that the IEEE standard prescribes, the differences it allows among different implementations can prevent us from writing portable, efficient numerical software whose behavior we can accurately predict. To develop such software, then, we must first create programming languages and environments that limit the variability the IEEE standard permits and allow programmers to express the floating-point semantics upon which their programs depend.

### Current IEEE 754 Implementations

Current implementations of IEEE 754 arithmetic can be divided into two groups distinguished by the degree to which they support different floating-point formats in hardware. Extended-based systems, exemplified by the Intel x86 family of processors, provide full support for an extended double precision format but only partial support for single and double precision: they provide instructions to load or store data in single and double precision, converting it on-the-fly to or from the extended double format, and they provide special modes (not the default) in which the results of arithmetic operations are rounded to single or double precision even though they are kept in registers in extended double format. (Motorola 68000 series processors round results to both the precision and range of the single or double formats in these modes. Intel x86 and compatible processors round results to the precision of the single or double formats but retain the same range as the extended double format.) Single/double systems, including most RISC processors, provide full support for single and double precision formats but no support for an IEEE-compliant extended double precision format. (The IBM POWER architecture provides only partial support for single precision, but for the purpose of this section, we classify it as a single/double system.)

To see how a computation might behave differently on an extended-based system than on a single/double system, consider a C version of the example from the section Systems Aspects:

 ```int main() { ``` ``` double q; ``` ``` q = 3.0/7.0; ``` ``` if (q == 3.0/7.0) printf("Equal\n"); ``` ``` else printf("Not Equal\n"); ``` ``` return 0; ``` ```} ```

Here the constants 3.0 and 7.0 are interpreted as double precision floating-point numbers, and the expression 3.0/7.0 inherits the `double` data type. On a single/double system, the expression will be evaluated in double precision since that is the most efficient format to use. Thus, `q` will be assigned the value 3.0/7.0 rounded correctly to double precision. In the next line, the expression 3.0/7.0 will again be evaluated in double precision, and of course the result will be equal to the value just assigned to `q`, so the program will print “Equal” as expected.

On an extended-based system, even though the expression 3.0/7.0 has type `double`, the quotient will be computed in a register in extended double format, and thus in the default mode, it will be rounded to extended double precision. When the resulting value is assigned to the variable `q`, however, it may then be stored in memory, and since `q` is declared `double`, the value will be rounded to double precision. In the next line, the expression 3.0/7.0 may again be evaluated in extended precision yielding a result that differs from the double precision value stored in `q`, causing the program to print “Not equal”. Of course, other outcomes are possible, too: the compiler could decide to store and thus round the value of the expression 3.0/7.0 in the second line before comparing it with `q`, or it could keep `q` in a register in extended precision without storing it. An optimizing compiler might evaluate the expression 3.0/7.0 at compile time, perhaps in double precision or perhaps in extended double precision. (With one x86 compiler, the program prints “Equal” when compiled with optimization and “Not Equal” when compiled for debugging.) Finally, some compilers for extended-based systems automatically change the rounding precision mode to cause operations producing results in registers to round those results to single or double precision, albeit possibly with a wider range. Thus, on these systems, we can’t predict the behavior of the program simply by reading its source code and applying a basic understanding of IEEE 754 arithmetic. Neither can we accuse the hardware or the compiler of failing to provide an IEEE 754 compliant environment; the hardware has delivered a correctly rounded result to each destination, as it is required to do, and the compiler has assigned some intermediate results to destinations that are beyond the user’s control, as it is allowed to do.

### Pitfalls in Computations on Extended-Based Systems

Conventional wisdom maintains that extended-based systems must produce results that are at least as accurate, if not more accurate than those delivered on single/double systems, since the former always provide at least as much precision and often more than the latter. Trivial examples such as the C program above as well as more subtle programs based on the examples discussed below show that this wisdom is naive at best: some apparently portable programs, which are indeed portable across single/double systems, deliver incorrect results on extended-based systems precisely because the compiler and hardware conspire to occasionally provide more precision than the program expects.

Current programming languages make it difficult for a program to specify the precision it expects. As the section Languages and Compilersmentions, many programming languages don’t specify that each occurrence of an expression like `10.0*x` in the same context should evaluate to the same value. Some languages, such as Ada, were influenced in this respect by variations among different arithmetics prior to the IEEE standard. More recently, languages like ANSI C have been influenced by standard-conforming extended-based systems. In fact, the ANSI C standard explicitly allows a compiler to evaluate a floating-point expression to a precision wider than that normally associated with its type. As a result, the value of the expression `10.0*x` may vary in ways that depend on a variety of factors: whether the expression is immediately assigned to a variable or appears as a subexpression in a larger expression; whether the expression participates in a comparison; whether the expression is passed as an argument to a function, and if so, whether the argument is passed by value or by reference; the current precision mode; the level of optimization at which the program was compiled; the precision mode and expression evaluation method used by the compiler when the program was compiled; and so on.

Language standards are not entirely to blame for the vagaries of expression evaluation. Extended-based systems run most efficiently when expressions are evaluated in extended precision registers whenever possible, yet values that must be stored are stored in the narrowest precision required. Constraining a language to require that `10.0*x` evaluate to the same value everywhere would impose a performance penalty on those systems. Unfortunately, allowing those systems to evaluate `10.0*x` differently in syntactically equivalent contexts imposes a penalty of its own on programmers of accurate numerical software by preventing them from relying on the syntax of their programs to express their intended semantics.

Do real programs depend on the assumption that a given expression always evaluates to the same value? Recall the algorithm presented in Theorem 4 for computing ln(1 + x), written here in Fortran:

 ```real function log1p(x) ``` ```real x ``` ```if (1.0 + x .eq. 1.0) then ``` ``` log1p = x ``` ```else ``` ``` log1p = log(1.0 + x) * x / ((1.0 + x) - 1.0) ``` ```endif ``` ```return ```

On an extended-based system, a compiler may evaluate the expression `1.0` `+` `x` in the third line in extended precision and compare the result with `1.0`. When the same expression is passed to the log function in the sixth line, however, the compiler may store its value in memory, rounding it to single precision. Thus, if `x` is not so small that `1.0` `+` `x` rounds to `1.0` in extended precision but small enough that `1.0` `+` `x` rounds to `1.0` in single precision, then the value returned by `log1p(x)` will be zero instead of `x`, and the relative error will be one–rather larger than 5. Similarly, suppose the rest of the expression in the sixth line, including the reoccurrence of the subexpression `1.0` `+` `x`, is evaluated in extended precision. In that case, if `x` is small but not quite small enough that `1.0` `+` `x` rounds to `1.0` in single precision, then the value returned by `log1p(x)`can exceed the correct value by nearly as much as `x`, and again the relative error can approach one. For a concrete example, take `x` to be 2-24 + 2-47, so `x` is the smallest single precision number such that `1.0` `+` `x` rounds up to the next larger number, 1 + 2-23. Then `log(1.0` `+` `x)` is approximately 2-23. Because the denominator in the expression in the sixth line is evaluated in extended precision, it is computed exactly and delivers `x`, so `log1p(x)` returns approximately 2-23, which is nearly twice as large as the exact value. (This actually happens with at least one compiler. When the preceding code is compiled by the Sun WorkShop Compilers 4.2.1 Fortran 77 compiler for x86 systems using the `-O`optimization flag, the generated code computes `1.0` `+` `x` exactly as described. As a result, the function delivers zero for `log1p(1.0e-10)` and `1.19209E-07` for `log1p(5.97e-8)`.)

For the algorithm of Theorem 4 to work correctly, the expression `1.0` `+` `x` must be evaluated the same way each time it appears; the algorithm can fail on extended-based systems only when `1.0` `+` `x` is evaluated to extended double precision in one instance and to single or double precision in another. Of course, since `log` is a generic intrinsic function in Fortran, a compiler could evaluate the expression `1.0` `+` `x` in extended precision throughout, computing its logarithm in the same precision, but evidently we cannot assume that the compiler will do so. (One can also imagine a similar example involving a user-defined function. In that case, a compiler could still keep the argument in extended precision even though the function returns a single precision result, but few if any existing Fortran compilers do this, either.) We might therefore attempt to ensure that `1.0` `+` `x` is evaluated consistently by assigning it to a variable. Unfortunately, if we declare that variable `real`, we may still be foiled by a compiler that substitutes a value kept in a register in extended precision for one appearance of the variable and a value stored in memory in single precision for another. Instead, we would need to declare the variable with a type that corresponds to the extended precision format. Standard FORTRAN 77 does not provide a way to do this, and while Fortran 95 offers the `SELECTED_REAL_KIND` mechanism for describing various formats, it does not explicitly require implementations that evaluate expressions in extended precision to allow variables to be declared with that precision. In short, there is no portable way to write this program in standard Fortran that is guaranteed to prevent the expression `1.0` `+` `x`from being evaluated in a way that invalidates our proof.

There are other examples that can malfunction on extended-based systems even when each subexpression is stored and thus rounded to the same precision. The cause is double-rounding. In the default precision mode, an extended-based system will initially round each result to extended double precision. If that result is then stored to double precision, it is rounded again. The combination of these two roundings can yield a value that is different than what would have been obtained by rounding the first result correctly to double precision. This can happen when the result as rounded to extended double precision is a “halfway case”, i.e., it lies exactly halfway between two double precision numbers, so the second rounding is determined by the round-ties-to-even rule. If this second rounding rounds in the same direction as the first, the net rounding error will exceed half a unit in the last place. (Note, though, that double-rounding only affects double precision computations. One can prove that the sum, difference, product, or quotient of two p-bit numbers, or the square root of a p-bit number, rounded first to q bits and then to p bits gives the same value as if the result were rounded just once to p bits provided q  2p + 2. Thus, extended double precision is wide enough that single precision computations don’t suffer double-rounding.)

Some algorithms that depend on correct rounding can fail with double-rounding. In fact, even some algorithms that don’t require correct rounding and work correctly on a variety of machines that don’t conform to IEEE 754 can fail with double-rounding. The most useful of these are the portable algorithms for performing simulated multiple precision arithmetic mentioned in the section Exactly Rounded Operations. For example, the procedure described in Theorem 6 for splitting a floating-point number into high and low parts doesn’t work correctly in double-rounding arithmetic: try to split the double precision number 252 + 3 × 226 – 1 into two parts each with at most 26 bits. When each operation is rounded correctly to double precision, the high order part is 252 + 227 and the low order part is 226 – 1, but when each operation is rounded first to extended double precision and then to double precision, the procedure produces a high order part of 252 + 228 and a low order part of -226 – 1. The latter number occupies 27 bits, so its square can’t be computed exactly in double precision. Of course, it would still be possible to compute the square of this number in extended double precision, but the resulting algorithm would no longer be portable to single/double systems. Also, later steps in the multiple precision multiplication algorithm assume that all partial products have been computed in double precision. Handling a mixture of double and extended double variables correctly would make the implementation significantly more expensive.

Likewise, portable algorithms for adding multiple precision numbers represented as arrays of double precision numbers can fail in double-rounding arithmetic. These algorithms typically rely on a technique similar to Kahan’s summation formula. As the informal explanation of the summation formula given on Errors In Summation suggests, if `s` and `y` are floating-point variables with |`s` |`y`| and we compute:

 ```t = s + y; ``` ```e = (s - t) + y; ```

then in most arithmetics, `e` recovers exactly the roundoff error that occurred in computing `t`. This technique doesn’t work in double-rounded arithmetic, however: if `s` = 252 + 1 and `y` = 1/2 – 2-54, then `s` `+` `y` rounds first to 252 + 3/2 in extended double precision, and this value rounds to 252 + 2 in double precision by the round-ties-to-even rule; thus the net rounding error in computing `t` is 1/2 + 2-54, which is not representable exactly in double precision and so can’t be computed exactly by the expression shown above. Here again, it would be possible to recover the roundoff error by computing the sum in extended double precision, but then a program would have to do extra work to reduce the final outputs back to double precision, and double-rounding could afflict this process, too. For this reason, although portable programs for simulating multiple precision arithmetic by these methods work correctly and efficiently on a wide variety of machines, they do not work as advertised on extended-based systems.

Finally, some algorithms that at first sight appear to depend on correct rounding may in fact work correctly with double-rounding. In these cases, the cost of coping with double-rounding lies not in the implementation but in the verification that the algorithm works as advertised. To illustrate, we prove the following variant of Theorem 7:

#### Theorem 7′

If m and n are integers representable in IEEE 754 double precision with |m| < 252 and n has the special form n = 2i + 2j, then (m  n)  n = m, provided both floating-point operations are either rounded correctly to double precision or rounded first to extended double precision and then to double precision.

#### Proof

Assume without loss that m > 0. Let q =  n. Scaling by powers of two, we can consider an equivalent setting in which 252  m < 253 and likewise for q, so that both m and q are integers whose least significant bits occupy the units place (i.e., ulp(m) = ulp(q) = 1). Before scaling, we assumed m < 252, so after scaling, is an even integer. Also, because the scaled values of m and q satisfy m/2 < q < 2m, the corresponding value of n must have one of two forms depending on which of m or q is larger: if q < m, then evidently 1 < n < 2, and since n is a sum of two powers of two, n = 1 + 2k for some k; similarly, if q > m, then 1/2 < n < 1, so n = 1/2 + 2-(k + 1). (As n is the sum of two powers of two, the closest possible value of n to one is n = 1 + 2-52. Because m/(1 + 2-52) is no larger than the next smaller double precision number less than m, we can’t have q = m.)

Let e denote the rounding error in computing q, so that q = m/n + e, and the computed value q  n will be the (once or twice) rounded value of m + ne. Consider first the case in which each floating-point operation is rounded correctly to double precision. In this case, |e| < 1/2. If nhas the form 1/2 + 2-(k + 1), then ne = nq – m is an integer multiple of 2-(k + 1) and |ne| < 1/4 + 2-(k + 2). This implies that |ne 1/4. Recall that the difference between m and the next larger representable number is 1 and the difference between m and the next smaller representable number is either 1 if m > 252 or 1/2 if m = 252. Thus, as |ne 1/4, m + ne will round to m. (Even if m = 252 and ne = -1/4, the product will round to m by the round-ties-to-even rule.) Similarly, if n has the form 1 + 2k, then ne is an integer multiple of 2k and |ne| < 1/2 + 2-(k + 1); this implies |ne 1/2. We can’t have m = 252 in this case because m is strictly greater than q, so m differs from its nearest representable neighbors by ±1. Thus, as |ne 1/2, again m + ne will round to m. (Even if |ne| = 1/2, the product will round to m by the round-ties-to-even rule because m is even.) This completes the proof for correctly rounded arithmetic.

In double-rounding arithmetic, it may still happen that q is the correctly rounded quotient (even though it was actually rounded twice), so |e| < 1/2 as above. In this case, we can appeal to the arguments of the previous paragraph provided we consider the fact that q  n will be rounded twice. To account for this, note that the IEEE standard requires that an extended double format carry at least 64 significant bits, so that the numbers m ± 1/2 and m ± 1/4 are exactly representable in extended double precision. Thus, if n has the form 1/2 + 2-(k + 1), so that |ne1/4, then rounding m + ne to extended double precision must produce a result that differs from m by at most 1/4, and as noted above, this value will round to m in double precision. Similarly, if n has the form 1 + 2k, so that |ne 1/2, then rounding m + ne to extended double precision must produce a result that differs from m by at most 1/2, and this value will round to m in double precision. (Recall that m > 252 in this case.)

Finally, we are left to consider cases in which q is not the correctly rounded quotient due to double-rounding. In these cases, we have |e| < 1/2 + 2-(d + 1) in the worst case, where d is the number of extra bits in the extended double format. (All existing extended-based systems support an extended double format with exactly 64 significant bits; for this format, d = 64 – 53 = 11.) Because double-rounding only produces an incorrectly rounded result when the second rounding is determined by the round-ties-to-even rule, q must be an even integer. Thus if n has the form 1/2 + 2-(k + 1), then ne = nq – m is an integer multiple of 2k, and

|ne| < (1/2 + 2-(k + 1))(1/2 + 2-(d + 1)) = 1/4 + 2-(k + 2) + 2-(d + 2) + 2-(k + d + 2).
If k  d, this implies |ne 1/4. If k > d, we have |ne 1/4 + 2-(d + 2). In either case, the first rounding of the product will deliver a result that differs from m by at most 1/4, and by previous arguments, the second rounding will round to m. Similarly, if n has the form 1 + 2k, then ne is an integer multiple of 2-(k – 1), and

|ne| < 1/2 + 2-(k + 1) + 2-(d + 1) + 2-(k + d + 1).
If k  d, this implies |ne 1/2. If k > d, we have |ne 1/2 + 2-(d + 1). In either case, the first rounding of the product will deliver a result that differs from m by at most 1/2, and again by previous arguments, the second rounding will round to mz

The preceding proof shows that the product can incur double-rounding only if the quotient does, and even then, it rounds to the correct result. The proof also shows that extending our reasoning to include the possibility of double-rounding can be challenging even for a program with only two floating-point operations. For a more complicated program, it may be impossible to systematically account for the effects of double-rounding, not to mention more general combinations of double and extended double precision computations.

### Programming Language Support for Extended Precision

The preceding examples should not be taken to suggest that extended precision per se is harmful. Many programs can benefit from extended precision when the programmer is able to use it selectively. Unfortunately, current programming languages do not provide sufficient means for a programmer to specify when and how extended precision should be used. To indicate what support is needed, we consider the ways in which we might want to manage the use of extended precision.

In a portable program that uses double precision as its nominal working precision, there are five ways we might want to control the use of a wider precision:

1. Compile to produce the fastest code, using extended precision where possible on extended-based systems. Clearly most numerical software does not require more of the arithmetic than that the relative error in each operation is bounded by the “machine epsilon”. When data in memory are stored in double precision, the machine epsilon is usually taken to be the largest relative roundoff error in that precision, since the input data are (rightly or wrongly) assumed to have been rounded when they were entered and the results will likewise be rounded when they are stored. Thus, while computing some of the intermediate results in extended precision may yield a more accurate result, extended precision is not essential. In this case, we might prefer that the compiler use extended precision only when it will not appreciably slow the program and use double precision otherwise.
2. Use a format wider than double if it is reasonably fast and wide enough, otherwise resort to something else. Some computations can be performed more easily when extended precision is available, but they can also be carried out in double precision with only somewhat greater effort. Consider computing the Euclidean norm of a vector of double precision numbers. By computing the squares of the elements and accumulating their sum in an IEEE 754 extended double format with its wider exponent range, we can trivially avoid premature underflow or overflow for vectors of practical lengths. On extended-based systems, this is the fastest way to compute the norm. On single/double systems, an extended double format would have to be emulated in software (if one were supported at all), and such emulation would be much slower than simply using double precision, testing the exception flags to determine whether underflow or overflow occurred, and if so, repeating the computation with explicit scaling. Note that to support this use of extended precision, a language must provide both an indication of the widest available format that is reasonably fast, so that a program can choose which method to use, and environmental parameters that indicate the precision and range of each format, so that the program can verify that the widest fast format is wide enough (e.g., that it has wider range than double).
3. Use a format wider than double even if it has to be emulated in software. For more complicated programs than the Euclidean norm example, the programmer may simply wish to avoid the need to write two versions of the program and instead rely on extended precision even if it is slow. Again, the language must provide environmental parameters so that the program can determine the range and precision of the widest available format.
4. Don’t use a wider precision; round results correctly to the precision of the double format, albeit possibly with extended range. For programs that are most easily written to depend on correctly rounded double precision arithmetic, including some of the examples mentioned above, a language must provide a way for the programmer to indicate that extended precision must not be used, even though intermediate results may be computed in registers with a wider exponent range than double. (Intermediate results computed in this way can still incur double-rounding if they underflow when stored to memory: if the result of an arithmetic operation is rounded first to 53 significant bits, then rounded again to fewer significant bits when it must be denormalized, the final result may differ from what would have been obtained by rounding just once to a denormalized number. Of course, this form of double-rounding is highly unlikely to affect any practical program adversely.)
5. Round results correctly to both the precision and range of the double format. This strict enforcement of double precision would be most useful for programs that test either numerical software or the arithmetic itself near the limits of both the range and precision of the double format. Such careful test programs tend to be difficult to write in a portable way; they become even more difficult (and error prone) when they must employ dummy subroutines and other tricks to force results to be rounded to a particular format. Thus, a programmer using an extended-based system to develop robust software that must be portable to all IEEE 754 implementations would quickly come to appreciate being able to emulate the arithmetic of single/double systems without extraordinary effort.

No current language supports all five of these options. In fact, few languages have attempted to give the programmer the ability to control the use of extended precision at all. One notable exception is the ISO/IEC 9899:1999 Programming Languages – C standard, the latest revision to the C language, which is now in the final stages of standardization.

The C99 standard allows an implementation to evaluate expressions in a format wider than that normally associated with their type, but the C99 standard recommends using one of only three expression evaluation methods. The three recommended methods are characterized by the extent to which expressions are “promoted” to wider formats, and the implementation is encouraged to identify which method it uses by defining the preprocessor macro `FLT_EVAL_METHOD`: if `FLT_EVAL_METHOD` is 0, each expression is evaluated in a format that corresponds to its type; if `FLT_EVAL_METHOD` is 1, `float` expressions are promoted to the format that corresponds to `double`; and if `FLT_EVAL_METHOD` is 2, `float` and `double` expressions are promoted to the format that corresponds to `long double`. (An implementation is allowed to set `FLT_EVAL_METHOD` to -1 to indicate that the expression evaluation method is indeterminable.) The C99 standard also requires that the `<math.h>` header file define the types `float_t` and `double_t`, which are at least as wide as `float` and `double`, respectively, and are intended to match the types used to evaluate `float` and `double`expressions. For example, if `FLT_EVAL_METHOD` is 2, both `float_t` and `double_t` are `long double`. Finally, the C99 standard requires that the `<float.h>`header file define preprocessor macros that specify the range and precision of the formats corresponding to each floating-point type.

The combination of features required or recommended by the C99 standard supports some of the five options listed above but not all. For example, if an implementation maps the `long double` type to an extended double format and defines `FLT_EVAL_METHOD` to be 2, the programmer can reasonably assume that extended precision is relatively fast, so programs like the Euclidean norm example can simply use intermediate variables of type `long double` (or `double_t`). On the other hand, the same implementation must keep anonymous expressions in extended precision even when they are stored in memory (e.g., when the compiler must spill floating-point registers), and it must store the results of expressions assigned to variables declared `double` to convert them to double precision even if they could have been kept in registers. Thus, neither the `double`nor the `double_t` type can be compiled to produce the fastest code on current extended-based hardware.

Likewise, the C99 standard provides solutions to some of the problems illustrated by the examples in this section but not all. A C99 standard version of the `log1p` function is guaranteed to work correctly if the expression `1.0` `+` `x` is assigned to a variable (of any type) and that variable used throughout. A portable, efficient C99 standard program for splitting a double precision number into high and low parts, however, is more difficult: how can we split at the correct position and avoid double-rounding if we cannot guarantee that `double` expressions are rounded correctly to double precision? One solution is to use the `double_t` type to perform the splitting in double precision on single/double systems and in extended precision on extended-based systems, so that in either case the arithmetic will be correctly rounded. Theorem 14 says that we can split at any bit position provided we know the precision of the underlying arithmetic, and the `FLT_EVAL_METHOD` and environmental parameter macros should give us this information.

The following fragment shows one possible implementation:

 ```#include ``` ```#include ``` ```#if (FLT_EVAL_METHOD==2) ``` ```#define PWR2 LDBL_MANT_DIG - (DBL_MANT_DIG/2) ``` ```#elif ((FLT_EVAL_METHOD==1) || (FLT_EVAL_METHOD==0)) ``` ```#define PWR2 DBL_MANT_DIG - (DBL_MANT_DIG/2) ``` ```#else ``` ```#error FLT_EVAL_METHOD unknown! ``` ```#endif ``` ```... ``` ``` double x, xh, xl; ``` ``` double_t m; ``` ``` m = scalbn(1.0, PWR2) + 1.0; // 2**PWR2 + 1 ``` ``` xh = (m * x) - ((m * x) - x); ``` ``` xl = x - xh; ```

Of course, to find this solution, the programmer must know that `double` expressions may be evaluated in extended precision, that the ensuing double-rounding problem can cause the algorithm to malfunction, and that extended precision may be used instead according to Theorem 14. A more obvious solution is simply to specify that each expression be rounded correctly to double precision. On extended-based systems, this merely requires changing the rounding precision mode, but unfortunately, the C99 standard does not provide a portable way to do this. (Early drafts of the Floating-Point C Edits, the working document that specified the changes to be made to the C90 standard to support floating-point, recommended that implementations on systems with rounding precision modes provide `fegetprec` and `fesetprec` functions to get and set the rounding precision, analogous to the `fegetround` and `fesetround` functions that get and set the rounding direction. This recommendation was removed before the changes were made to the C99 standard.)

Coincidentally, the C99 standard’s approach to supporting portability among systems with different integer arithmetic capabilities suggests a better way to support different floating-point architectures. Each C99 standard implementation supplies an `<stdint.h>` header file that defines those integer types the implementation supports, named according to their sizes and efficiency: for example, `int32_t` is an integer type exactly 32 bits wide, `int_fast16_t` is the implementation’s fastest integer type at least 16 bits wide, and `intmax_t` is the widest integer type supported. One can imagine a similar scheme for floating-point types: for example, `float53_t` could name a floating-point type with exactly 53 bit precision but possibly wider range, `float_fast24_t` could name the implementation’s fastest type with at least 24 bit precision, and `floatmax_t` could name the widest reasonably fast type supported. The fast types could allow compilers on extended-based systems to generate the fastest possible code subject only to the constraint that the values of named variables must not appear to change as a result of register spilling. The exact width types would cause compilers on extended-based systems to set the rounding precision mode to round to the specified precision, allowing wider range subject to the same constraint. Finally, `double_t` could name a type with both the precision and range of the IEEE 754 double format, providing strict double evaluation. Together with environmental parameter macros named accordingly, such a scheme would readily support all five options described above and allow programmers to indicate easily and unambiguously the floating-point semantics their programs require.

Must language support for extended precision be so complicated? On single/double systems, four of the five options listed above coincide, and there is no need to differentiate fast and exact width types. Extended-based systems, however, pose difficult choices: they support neither pure double precision nor pure extended precision computation as efficiently as a mixture of the two, and different programs call for different mixtures. Moreover, the choice of when to use extended precision should not be left to compiler writers, who are often tempted by benchmarks (and sometimes told outright by numerical analysts) to regard floating-point arithmetic as “inherently inexact” and therefore neither deserving nor capable of the predictability of integer arithmetic. Instead, the choice must be presented to programmers, and they will require languages capable of expressing their selection.

### Conclusion

The foregoing remarks are not intended to disparage extended-based systems but to expose several fallacies, the first being that all IEEE 754 systems must deliver identical results for the same program. We have focused on differences between extended-based systems and single/double systems, but there are further differences among systems within each of these families. For example, some single/double systems provide a single instruction to multiply two numbers and add a third with just one final rounding. This operation, called a fused multiply-add, can cause the same program to produce different results across different single/double systems, and, like extended precision, it can even cause the same program to produce different results on the same system depending on whether and when it is used. (A fused multiply-add can also foil the splitting process of Theorem 6, although it can be used in a non-portable way to perform multiple precision multiplication without the need for splitting.) Even though the IEEE standard didn’t anticipate such an operation, it nevertheless conforms: the intermediate product is delivered to a “destination” beyond the user’s control that is wide enough to hold it exactly, and the final sum is rounded correctly to fit its single or double precision destination.

The idea that IEEE 754 prescribes precisely the result a given program must deliver is nonetheless appealing. Many programmers like to believe that they can understand the behavior of a program and prove that it will work correctly without reference to the compiler that compiles it or the computer that runs it. In many ways, supporting this belief is a worthwhile goal for the designers of computer systems and programming languages. Unfortunately, when it comes to floating-point arithmetic, the goal is virtually impossible to achieve. The authors of the IEEE standards knew that, and they didn’t attempt to achieve it. As a result, despite nearly universal conformance to (most of) the IEEE 754 standard throughout the computer industry, programmers of portable software must continue to cope with unpredictable floating-point arithmetic.

If programmers are to exploit the features of IEEE 754, they will need programming languages that make floating-point arithmetic predictable. The C99 standard improves predictability to some degree at the expense of requiring programmers to write multiple versions of their programs, one for each `FLT_EVAL_METHOD`. Whether future languages will choose instead to allow programmers to write a single program with syntax that unambiguously expresses the extent to which it depends on IEEE 754 semantics remains to be seen. Existing extended-based systems threaten that prospect by tempting us to assume that the compiler and the hardware can know better than the programmer how a computation should be performed on a given system. That assumption is the second fallacy: the accuracy required in a computed result depends not on the machine that produces it but only on the conclusions that will be drawn from it, and of the programmer, the compiler, and the hardware, at best only the programmer can know what those conclusions may be.

1 Examples of other representations are floating slash and signed logarithm [Matula and Kornerup 1985; Swartzlander and Alexopoulos 1975].2 This term was introduced by Forsythe and Moler [1967], and has generally replaced the older term mantissa.3 This assumes the usual arrangement where the exponent is stored to the left of the significand.4 Unless the number z is larger than  +1 or smaller than  . Numbers which are out of range in this fashion will not be considered until further notice.5 Let z‘ be the floating-point number that approximates z. Then d.dd – (z/e)p-1 is equivalent to z‘-z/ulp(z‘). A more accurate formula for measuring error is z‘-z/ulp(z). – Ed.6 700, not 70. Since .1 – .0292 = .0708, the error in terms of ulp(0.0292) is 708 ulps. – Ed.7 Although the expression (x – y)(x + y) does not cause a catastrophic cancellation, it is slightly less accurate than x2 – y2 if  or  . In this case, (x – y)(x + y) has three rounding errors, but x2 – yhas only two since the rounding error committed when computing the smaller of x2 and y2 does not affect the final subtraction.8 Also commonly referred to as correctly rounded. – Ed.9 When n = 845, xn= 9.45, xn + 0.555 = 10.0, and 10.0 – 0.555 = 9.45. Therefore, xn = x845 for n > 845.10 Notice that in binary, q cannot equal  . – Ed.11 Left as an exercise to the reader: extend the proof to bases other than 2. – Ed.12 This appears to have first been published by Goldberg [1967], although Knuth ([1981], page 211) attributes this idea to Konrad Zuse.
13 According to Kahan, extended precision has 64 bits of significand because that was the widest precision across which carry propagation could be done on the Intel 8087 without increasing the cycle time [Kahan 1988].
14 Some arguments against including inner product as one of the basic operations are presented by Kahan and LeBlanc [1985].15 Kirchner writes: It is possible to compute inner products to within 1 ulp in hardware in one partial product per clock cycle. The additionally needed hardware compares to the multiplier array needed anyway for that speed.
16 CORDIC is an acronym for Coordinate Rotation Digital Computer and is a method of computing transcendental functions that uses mostly shifts and adds (i.e., very few multiplications and divisions) [Walther 1971]. It is the method additionally needed hardware compares to the multiplier array needed anyway for that speed. d used on both the Intel 8087 and the Motorola 68881.
17 Fine point: Although the default in IEEE arithmetic is to round overflowed numbers to , it is possible to change the default (see Rounding Modes)18 They are called subnormal in 854, denormal in 754.19 This is the cause of one of the most troublesome aspects of the standard. Programs that frequently underflow often run noticeably slower on hardware that uses software traps.
20 No invalid exception is raised unless a “trapping” NaN is involved in the operation. See section 6.2 of IEEE Std 754-1985. – Ed.21  may be greater than  if both x and y are negative. – Ed.22 It can be in range because if x < 1, n < 0 and xn is just a tiny bit smaller than the underflow threshold  , then  , and so may not overflow, since in all IEEE precisions, –emin < emax.23 This is probably because designers like “orthogonal” instruction sets, where the precisions of a floating-point instruction are independent of the actual operation. Making a special case for multiplication destroys this orthogonality.24 This assumes the common convention that `3.0` is a single-precision constant, while `3.0D0` is a double precision constant.25 The conclusion that 00 = 1 depends on the restriction that f be nonconstant. If this restriction is removed, then letting f be the identically 0 function gives 0 as a possible value for lim  0 f(x)g(x), and so 00 would have to be defined to be a NaN.26 In the case of 00, plausibility arguments can be made, but the convincing argument is found in “Concrete Mathematics” by Graham, Knuth and Patashnik, and argues that 00 = 1 for the binomial theorem to work. – Ed.
27 Unless the rounding mode is round toward –, in which case x – x = -0.28 The VMS math libraries on the VAX use a weak form of in-line procedure substitution, in that they use the inexpensive jump to subroutine call rather than the slower `CALLS` and `CALLG` instructions.29 The difficulty with presubstitution is that it requires either direct hardware implementation, or continuable floating-point traps if implemented in software. – Ed.30 In this informal proof, assume that  = 2 so that multiplication by 4 is exact and doesn’t require a i.31 This is the sum if adding w does not generate carry out. Additional argument is needed for the special case where adding w does generate carry out. – Ed.32 Rounding gives kx + wk – ronly if (kx + wkkeeps the form of kx Ed.