Editor’s note: Imported from my old personal blog @ TC with minor edits to improve readability where necessary.
You may have heard me say, only half jokingly at best, “I don’t know Perl, I know Combat Perl”. When it comes to C I am even less confident. In that arena I’m prone to say, “I don’t know C, I know Windmill C Technique”. Picture the grade school kid whose only move against the playground bully is to closes his eyes, start his arms spinning wildly perpendicular to the ground like a human mowing machine while he nervously approaches his target. That’s me writing C code. It was with some trepidation that I even post this blog entry, making all Windmill C coders look good.
In a recent battle involving a pcap parser I needed to implement a routine that “sampled” packets to help ease the processing load for the user when encountering large pcap files. A packet should be selected at random using a configurable predefined rate. The random process needn’t be cryptographically secure, and I wasn’t going to worry about bias already present in the order of packets in the pcap file, but my code ought to be statistically random enough to obtain a good cross section of packets in a typical pcap file. For example, instead of processing every single packet in a pcap file, the user should be able to process approximately one in every 100 packets, a reduction by two orders of magnitude.
My job seemed simple enough. First, setup a command line option to accept a sampling rate. Then when reading packets serially, perform a test using a random number with the given sampling rate to determine if the current packet should be processed or not. At the start of a packet processing routine, I implemented something that looks like this:
if ( sample_flag > 0 && rand() % sample_rate > 0 ) {
return;
}
sample_flag
is essentially a boolean. sample_rate
is the
user-defined sampling rate, which might be some sane integer value n
where 1 < n < 2^31 - 1
(RAND_MAX
in my stdlib.h
is 2^31 - 1
).
Upon reaching the if
conditional test, the rand()
function returns a
value modulo the sample rate if sampling is enabled. This value would
be between zero and the sample rate minus one, inclusive. If the result
of the modulo operation is not equal to zero, the packet is skipped,
otherwise, only when the modulo result is zero will the packet be
processed. Statistically speaking, the value zero will appear, or a
packet will be processed, approximately one in every sample_rate
times. This random sampling should be sufficient for my task. Again,
it may not be cryptographically secure and it doesn’t account for any
bias in the packet capture itself, but it should be reasonably random
across distinct runs of any pcap file.
Not so fast. When I tested this sampling code using any pcap file, the
same exact sequence of random numbers was generated each time through.
In other words, the so-called random sequence of numbers was the same
sequence for each incarnation of the application. That certainly isn’t
what I expected. In order to use the rand()
function effectively you
must properly seed or initialize the random number generator using the
srand()
function. To my chagrin, I did this and was still getting
what should have been a practically impossible duplication of random
number sequences. I initialized the random number generator in the same
fashion as one recently suggested in a Stack Overflow answer to “How to
‘randomize()’ random numbers in
C(Linux)?”:
FILE *urandom_fp = fopen( "/dev/urandom", "r" );
if ( urandom_fp == NULL ) {
perror( "/dev/urandom" );
exit(EXIT_FAILURE);
}
unsigned int seed;
int result = fread( &seed, sizeof(int), 1, urandom_fp );
if ( result != 1 ) {
fclose(urandom_fp);
fprintf( stderr, "/dev/urandom read error\n" );
exit(EXIT_FAILURE);
}
if ( fclose(urandom_fp) == EOF ) {
perror( "/dev/urandom" );
exit(EXIT_FAILURE):
}
srand(seed);
Now again, this may not be the most cryptographically secure way to
initialize the random number generator, but it ought to be more than
adequate for my purposes. I was reading an integer’s worth of data for
the seed from /dev/urandom
, a special non-blocking file
that provides pseudo-random data and is found on the Unix system I’m
working on. The seed essentially determines the sequence of the random
numbers that rand()
generates. If you don’t seed the
random number generator, the default seed value is 1. Unless you
want perfectly predictable random numbers, which is usually only done in
testing and measurement situations, you typically want to seed the
random number generator with as random a number as you can get. Hence
my call to read from urandom
. Yet, while it looked like
what I was doing should produce reasonably decent random number
sequences across application invocations, I was getting entirely
predictable sequences each time.
Recall that this coder codes using the Windmill Technique so it wasn’t
immediately obvious how I had seemingly implemented this sampling
process incorrectly. After a few minutes of head scratching and
rewriting the sampling routine in different ways to debug what was going
on, it finally dawned on me to to see if srand()
was being
called somewhere else, after my own call to use urandom
for
the seed. I happened to be borrowing some code from the Bloom filter C
library in the
Bloom::Faster
Perl module on CPAN. It didn’t take me long to discover that
bloom.c
contained the following:
srand(CONS);
and in bloom.h
this:
#define CONS 1234567
D’oh! Since I was calling the code in this Bloom filter library after I
called srand()
myself, the random number generator was
being reinitialized using the static seed of 1234567
instead of the better, pseudo-random seed from urandom
for
each invocation of the application. I immediately removed the call to
srand()
from the Bloom library code I was using and got the
randomness I originally expected.
For a fleeting moment I considered that perhaps I’m not as bad a C programmer as I thought. That idea quickly faded as I proceeded to Windmill my way through the rest of the application I was building. To be fair, the Bloom filter library I was adapting didn’t require as good randomization as I needed and maybe the original implementation wanted to achieve some measure of reproducibility. Regardless, in the future I’m likely to be more cognizant of how my applications’ random numbers are being generated even when at first glance they appear to have been seeded effectively.
For a good introduction to generating random data with pointers to other sources of information, including the challenges in obtaining random data for cryptographic applications, see Ferguson and Schneier’s Practical Cryptography chapter entitled Generating Randomness. Note, I realize there is an updated version of the book with a new name, I just don’t have it. The new version, Cryptography Engineering, appears to have retained the chapter with presumably equal or better content.
On a related note, sadly Google Code Search is reportedly shutting down
on January 15,
2012. If
you’re reading this before then, you might want to hurry up and search
for code to see how others have called
srand()
.
If you do, you will notice that not all seeds are created equal. Is
there code that calls rand()
, but never srand()
? Can you find
predictable sequences that lead to unintended consequences?