2009-05-27

Regex matching email address and some benchmarking

Today I was trying to help a colleague with a regular expression (regex) for matching the most common email formats (Not quite RFC 2822, as most of the exceptions there is almost never seen).

My colleague came up with this expression:

^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$

This seems to work well, except for rare addresses having + in the local part.

My suggestion was this:

^[\w\.+-]+@(?:[a-z0-9-]{2,}\.)+[a-z0-9]{2,4}$

I wondered if mine was more efficient, and used Benchmark to try to figure this out.

My script looks like this:

#!/usr/bin/perl -w

use strict;
use Benchmark qw(:all);;

my $email = $ARGV[0]; # Supply the email address to benchmark as an argument

my $count = 1000000;

my $results = timethese($count, {'Optimized' => sub { $email =~ /^[\w\.+-]+@(?:[a-z0-9\-]{2,}\.)+[a-z0-9]{2,4}$/},
'Original' => sub { $email =~ /^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$/}
});

cmpthese($results);

I first tried the most common format (I think), localpart@domain:
$ ./mailre.pl ballek@supperaadet.not
Benchmark: timing 1000000 iterations of Optimized, Original...
Optimized: 2 wallclock secs ( 1.50 usr + 0.01 sys = 1.51 CPU) @ 662251.66/s (n=1000000)
Original: 3 wallclock secs ( 2.88 usr + 0.03 sys = 2.91 CPU) @ 343642.61/s (n=1000000)
Rate Original Optimized
Original 343643/s -- -48%
Optimized 662252/s 93% --

Then the heavily used firstname.lastname@domain:

$ ./mailre.pl balle.klorin@supperaadet.not
Benchmark: timing 1000000 iterations of Optimized, Original...
Optimized: 1 wallclock secs ( 1.36 usr + 0.00 sys = 1.36 CPU) @ 735294.12/s (n=1000000)
Original: 4 wallclock secs ( 2.85 usr + 0.01 sys = 2.86 CPU) @ 349650.35/s (n=1000000)
Rate Original Optimized
Original 349650/s -- -52%
Optimized 735294/s 110% --

At last, a format used by some MTA for local aliasing (Exim and Postfix usually uses - and qmail uses +):

$ ./mailre.pl balle.klorin-list@supperaadet.not
Benchmark: timing 1000000 iterations of Optimized, Original...
Optimized: 0 wallclock secs ( 1.44 usr + 0.00 sys = 1.44 CPU) @ 694444.44/s (n=1000000)
Original: 2 wallclock secs ( 2.97 usr + 0.00 sys = 2.97 CPU) @ 336700.34/s (n=1000000)
Rate Original Optimized
Original 336700/s -- -52%
Optimized 694444/s 106% --

I was not surprised that my version was faster. The real world impact of writing slow regex might not be that big (1 million test took about 3 seconds for the unoptimized version), and for this problem, matching email addresses, the product will not become noticeably slower. On the other hand, if you write something that parses millions of lines on a frequent basis, you should try to optimize your regex.

No comments:

Post a Comment