Code: Watch out for someone else's srand()

Editor’s note: Imported from my old personal blog @ TC with minor edits to improve readability where necessary.

You may have heard me say, only half jokingly at best, “I don’t know Perl, I know Combat Perl”. When it comes to C I am even less confident. In that arena I’m prone to say, “I don’t know C, I know Windmill C Technique”. Picture the grade school kid whose only move against the playground bully is to closes his eyes, start his arms spinning wildly perpendicular to the ground like a human mowing machine while he nervously approaches his target. That’s me writing C code. It was with some trepidation that I even post this blog entry, making all Windmill C coders look good.

In a recent battle involving a pcap parser I needed to implement a routine that “sampled” packets to help ease the processing load for the user when encountering large pcap files. A packet should be selected at random using a configurable predefined rate. The random process needn’t be cryptographically secure, and I wasn’t going to worry about bias already present in the order of packets in the pcap file, but my code ought to be statistically random enough to obtain a good cross section of packets in a typical pcap file. For example, instead of processing every single packet in a pcap file, the user should be able to process approximately one in every 100 packets, a reduction by two orders of magnitude.

My job seemed simple enough. First, setup a command line option to accept a sampling rate. Then when reading packets serially, perform a test using a random number with the given sampling rate to determine if the current packet should be processed or not. At the start of a packet processing routine, I implemented something that looks like this:

if ( sample_flag > 0 && rand() % sample_rate > 0 ) {
    return;
}

sample_flag is essentially a boolean. sample_rate is the user-defined sampling rate, which might be some sane integer value n where 1 < n < 2^31 - 1 (RAND_MAX in my stdlib.h is 2^31 - 1). Upon reaching the if conditional test, the rand() function returns a value modulo the sample rate if sampling is enabled. This value would be between zero and the sample rate minus one, inclusive. If the result of the modulo operation is not equal to zero, the packet is skipped, otherwise, only when the modulo result is zero will the packet be processed. Statistically speaking, the value zero will appear, or a packet will be processed, approximately one in every sample_rate times. This random sampling should be sufficient for my task. Again, it may not be cryptographically secure and it doesn’t account for any bias in the packet capture itself, but it should be reasonably random across distinct runs of any pcap file.

Not so fast. When I tested this sampling code using any pcap file, the same exact sequence of random numbers was generated each time through. In other words, the so-called random sequence of numbers was the same sequence for each incarnation of the application. That certainly isn’t what I expected. In order to use the rand() function effectively you must properly seed or initialize the random number generator using the srand() function. To my chagrin, I did this and was still getting what should have been a practically impossible duplication of random number sequences. I initialized the random number generator in the same fashion as one recently suggested in a Stack Overflow answer to “How to ‘randomize()’ random numbers in C(Linux)?":

FILE *urandom_fp = fopen( "/dev/urandom", "r" );
if ( urandom_fp == NULL ) {
    perror( "/dev/urandom" );
    exit(EXIT_FAILURE);
}

unsigned int seed;
int result = fread( &seed, sizeof(int), 1, urandom_fp );
if ( result != 1 ) {
    fclose(urandom_fp);
    fprintf( stderr, "/dev/urandom read error\n" );
    exit(EXIT_FAILURE);
}

if ( fclose(urandom_fp) == EOF ) {
    perror( "/dev/urandom" );
    exit(EXIT_FAILURE):
}

srand(seed);

Now again, this may not be the most cryptographically secure way to initialize the random number generator, but it ought to be more than adequate for my purposes. I was reading an integer’s worth of data for the seed from /dev/urandom, a special non-blocking file that provides pseudo-random data and is found on the Unix system I’m working on. The seed essentially determines the sequence of the random numbers that rand() generates. If you don’t seed the random number generator, the default seed value is 1. Unless you want perfectly predictable random numbers, which is usually only done in testing and measurement situations, you typically want to seed the random number generator with as random a number as you can get. Hence my call to read from urandom. Yet, while it looked like what I was doing should produce reasonably decent random number sequences across application invocations, I was getting entirely predictable sequences each time.

Recall that this coder codes using the Windmill Technique so it wasn’t immediately obvious how I had seemingly implemented this sampling process incorrectly. After a few minutes of head scratching and rewriting the sampling routine in different ways to debug what was going on, it finally dawned on me to to see if srand() was being called somewhere else, after my own call to use urandom for the seed. I happened to be borrowing some code from the Bloom filter C library in the Bloom::Faster Perl module on CPAN. It didn’t take me long to discover that bloom.c contained the following:

srand(CONS);

and in bloom.h this:

#define CONS 1234567

D’oh! Since I was calling the code in this Bloom filter library after I called srand() myself, the random number generator was being reinitialized using the static seed of 1234567 instead of the better, pseudo-random seed from urandom for each invocation of the application. I immediately removed the call to srand() from the Bloom library code I was using and got the randomness I originally expected.

For a fleeting moment I considered that perhaps I’m not as bad a C programmer as I thought. That idea quickly faded as I proceeded to Windmill my way through the rest of the application I was building. To be fair, the Bloom filter library I was adapting didn’t require as good randomization as I needed and maybe the original implementation wanted to achieve some measure of reproducibility. Regardless, in the future I’m likely to be more cognizant of how my applications' random numbers are being generated even when at first glance they appear to have been seeded effectively.

For a good introduction to generating random data with pointers to other sources of information, including the challenges in obtaining random data for cryptographic applications, see Ferguson and Schneier’s Practical Cryptography chapter entitled Generating Randomness. Note, I realize there is an updated version of the book with a new name, I just don’t have it. The new version, Cryptography Engineering, appears to have retained the chapter with presumably equal or better content.

On a related note, sadly Google Code Search is reportedly shutting down on January 15, 2012. If you’re reading this before then, you might want to hurry up and search for code to see how others have called srand(). If you do, you will notice that not all seeds are created equal. Is there code that calls rand(), but never srand()? Can you find predictable sequences that lead to unintended consequences?