Saturday, June 25, 2011

Embracing non-determinism

Computers are supposed to be deterministic. This is often the case for single processor machines. However, as you scale up, guaranteeing determinism becomes increasingly expensive.

Even on single processor machines you are facing non-determinism on semi-regular basis. Here are some examples

  • Bugs + poor OS memory control that allows programs to read uninitialized memory. A recent example for me was when I used FontForge to process a large set of files, and had it crash on a different set every time I ran it.
  • /dev/random gives "truly random" numbers, however, even if the program uses regular rand() seeded by time, it's essentially non-deterministic since so many factors affect the time of seed()
  • Floating point operations on modern architecture are non-deterministic because of run-time optimization -- depending on the pattern of memory allocation, processor will rearrange the order of your arithmetic operations, producing slightly different results on reruns
  • Soft-errors in circuits due to radiation, about one error per day, according to stats here
  • Some CPU units are bad and produce incorrect results. I don't know exact stats on this, but the guy who did the Petasort experiment told me that it was one of the issues they had to deal with.

Those modes of failure may be rare on one computer, but once you scale your computation up to many cores, it becomes a fact of life.

In a distributed application like sorting a petabyte of data, you should expect about 1 hardrive failure, 100 irrecoverable disk read errors, and a couple of undetected memory errors if you use error correcting RAM (more if you use regular RAM).

You can make your computation deterministic by adding checksums at each stage and recomputing the parts that fail. However, this is expensive, and becomes disporportionately more expensive as you scale up. When the Czajkowski et al sorted 100 trillion records deterministically, most of the time was spent on error recovery.

I've been running an aggregation that is modest in comparison, going over over about 100 TB of data and getting slightly different results each time. Theoretically I could add recovery stages that get rid of non-determinism, but at a cost that is not justified in my case.

Large scale analysis systems like Dremel are configured to produce approximate results by default, otherwise, the long tail of packet arrival times means that you'll spent most of the time waiting for a few "stragglers"

Given the increasing cost of determinism it makes sense to think of it in terms of cost-benefit analysis and trade off the benefit of reproducible results against the cost of 100% error recovery.

A related idea come up in Vikash Mansinghka presentation at NIPS 2010 workshop on monte carlo methods: Real world decision problems can be formulated as inference and solved with MCMC sampling, so it doesn't make sense to spend so much effort ensuring determinism only to destroy it with /dev/random -- you could achieve greater efficiency by using analogue circuits to start with. His company, Navia, is working in this direction.

There's some more discussion on determinism on Daniel Lemire's blog


Itman said...

I agree. It is not only hardware errors. It is also the order of execution. Perhaps, most non-determinism comes from multi-threaded applications.

Anonymous said...

Get access to all available offers from paypal for free and it all works. Redeem free paypal cash and get many more. Good and very informative. Love to share such blogs with all my friends same as the website which is menat to be shared free paypal money adder as it is a good public welfare work to do, this mentioned website has free paypal money and gift cards to offers. its real and working, Visit today.

Unknown said...

nice blog
android training in bangalore
ios training in bangalore
machine learning online training

Unknown said...

useful blog
python interview questions
cognos interview questions
perl interview questions
vlsi interview questions
web api interview questions
msbi interview questions

Unknown said...

laravel interview questions
aem interview questions
salesforce interview questions
oops abab interview questions
itil interview questions
informatica interview questions
extjs interview questions

Unknown said...

sap bi interview questions
hive interview questions
seo interview questions
as400 interview questions
wordpress interview questions
accounting interview questions
basic accounting and financial interview questions

katetech said...

Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
digital marketing internship in vijayawada
project internships in vijayawada

pythonV said...

amazing content really informational !!

check our site too :pytholabs

basha said...

Excellent blog I visit this blog it's really awesome. The important thing is that in this blog content written clearly and understandable. The content of information is very informative.
oracle fusion financials classroom training
Workday HCM Online Training
Oracle Fusion Financials Online Training
Oracle Fusion HCM Online Training
Oracle Fusion SCM Online Training
Oracle Fusion HCM Classroom Training

Yunita Puspita said...

RaksasaPoker Agen Domino Server Terbaru dan BandarQ Terpercaya Indonesia
Website Judi Online Yang memberikan Pelayanan Terbaik dan mempunyai Server Yang Berkualitas Di Dalam Permainan Asli tanpa Robot Ataupun Admin Ikut Serta Bermain Serta Memberikan Kemudahan Dalam Melakukan Proses Transaksi Deposit Dan Wirthdaw Hanya 2 Menit Dana Anda Pasti Sudah Di Proses.




Agen Poker

Agen Domino99

Agen Capsa

Gabung Sekarang Juga Dan Dapatkan Ratusan Juta Bersama Kamihttps://akhsadew.blogspot.comAC

remo said...

Digital Marketing Internship Program in BangaloreDigital Marketing Internship Program in Bangalore

draj said...

Excellent machine learning blog,thanks for sharing...
Seo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore

Anonymous said...

Hello, o you know that the best way to boost your brain is
visiting or contacting us

Pila Brass Knuckles Online for sale online,

where to buy Buy Pila Brass Knuckles Online

buy valley online cartridge

buy space candy online

buy cannabis syrup online

buy botox online

cannabis bread

uk chese

47 dank vapewhite fire og

buy moonrock

Email Us

Contact: +1 619-537-6734 said...

really awesome post
thanks for sharing

playapkpro said...

In a distributed application like sorting a petabyte of data, you should expect about 1 hard drive failure, 100 irrecoverable disk read errors, and a couple of undetected memory errors if you use error-correcting RAM (more if you use regular RAM).
that type of error is really hit on chats apps or software on the device