Testing methodology (sub)Table of contents
The types of hosts I selected for the survey are perhaps not very meaningfully correlated statistically. I simply wanted a broad range of interesting, high-profile systems, which I think I got. However, I simply don't know enough statistics to do a statistical survey - that's not what this paper is about, although I'd very much like to see such a project done.
(Note. Most of this subsection is an attempt to explain my thought processes; other than the last paragraph, which describes how I finally selected a method to choose semi-random systems, it has very little bearing on the overall survey results.)
I wanted a standard against which to compare the results of my surveyed hosts, and at first I thought of comparing them to relatively random hosts using a list of personal web pages from Yahoo. I planned to use the web sites of the people listed under some letter for my sample. The letter "S" had over three thousand (!) names, so I simply sifted the names and selected 500 random sites. (I chose to scan approximately 500 random hosts so that I could do some sort of statistically meaningful comparison with the 1700 other sites I surveyed.)
There were several problems with this approach (people who know anything about statistics can kindly refrain from laughing.) One of the most significant problems was that large sites (like, say, AOL) were over-represented - IF I wanted a truly random sampling. Another problem was that I didn't know if I wanted random web sites or simply random sites. After I talked to a statistics pal of mine, I decided that I'd better start over.
So I then decided to try a really random set of hosts. I thought: well, what better way than to generate a series of random numbers that would span all possible IP numbers? I started generating four numbers, basically from 1-254, strung them together, and then pinged that IP number to get reasonably random IP address (modulo the poor quality of the perl random number generator.) I quickly realized that even if there were approximately 5 million active IP addresses it would take me about 1,000 tries to get one active IP address. At a few seconds for each trial, it would take about 42 minutes to find one host. Even in parallel, I wanted faster results.
So in the end I sacrificed randomness for expedience, and used the Network Wizards host tables (which are a few months old) and pseudo randomly (again, using perl's rand() function) selected from the 500 megabytes of data there. For those interested, here is my program to get a random number of hosts out of hostfile. I am not a statistician, as you can tell. But I did gather approximately 500 hosts (469 out of the 500; see the source to see why it didn't get all 500), and I feel confident that they are as random as reasonably possible; indeed, they are significantly more random than the other hosts in the survey (given Yahoo's method of listing sites).
SATAN did most of what I wanted for this survey. Specifically, it:
The other useful tools for this sort of work are:
Keep in mind that no hosts were actually broken into. However, there are many ways to detect potential problems without actually breaking into a system. Some of these problems are due to ignorance or host or network misconfigurations, but many of them are taken from CERT advisories. Here are the specific problems looked at:
It took quite a long time to get all the results, partly due to a general lack of preparation on my part. SATAN didn't quite do all the tests that I wanted to run against the survey hosts, so I had to rerun bits and pieces of it several times. And, while no hosts were broken into while performing the survey, the techniques used could have easily have been mistaken for an attack. The survey would leave entries in the survey host's audit records; certainly anyone running one of the SATAN detectors (such as courtney) or the very popular and effective TCP wrappers would have detected some fairly suspicious activity as well.
Despite all of this, I only received 3 pieces of e-mail because of the survey - two from the main survey sites (that's a .12 percent response rate), who also e-mailed my ISP, and one from a host in the random sampling (a .2 percent rate), who CC'd CERT in the e-mail. They were were all initially suspicious, fearing a failed attack, but they calmed down after I apologized for the infraction. One of them accepted my offer to further scan his site for potential problems (there were none), and the other gave me constructive comments on a rough draft of this paper. It's possible that other response teams were contacted and I know nothing about them, or that other investigations were spurred by the survey; I sincerely hope that this is not the case and that people were not unduly alarmed by the survey.
There were some additional signs that the survey was detected, but they were very sparse. Two sites out of the 1700 main survey hosts (that's about .1 percent!) fingered the host I was running the survey from (tsunami.trouble.org). Interestingly, I received responses from the random sites four times out of all the 469 hosts (0.85 percent - that's zero-point-eight-five percent, not eighty-five percent!) - still an almost imperceptable rate, but still an order of magnitude more than the normal survey sites.
The survey was presumably detected in greater number than this; unfortunately there is no way to determine just how frequently it was discovered.
But that was it - there were no other signs that my survey was detected or that it caused any alarm. And, since trouble.org is still a very new site, with no advertised services and having only myself as a user, it was easy to spot any probes directed at my site that were a result of the survey by comparing the probes with the SATAN scan logs.