2 re::engine::Plugin - API to write custom regex engines
8 As of perl 5.9.5 it's possible to lexically replace perl's built-in
9 regular expression engine with your own (see perlreapi and perlpragma).
10 This module provides a glue interface to the relevant parts of the perl
11 C API enabling you to write an engine in Perl instead of the C/XS
12 interface provided by the core.
15 Each regex in perl is compiled into an internal "REGEXP" structure (see
16 perlreapi), this can happen either during compile time in the case of
17 patterns in the format "/pattern/" or runtime for "qr//" patterns, or
18 something inbetween depending on variable interpolation etc.
20 When this module is loaded into a scope it inserts a hook into
21 $^H{regcomp} (as described in perlreapi and perlpragma) to have each
22 regexp constructed in its lexical scope handled by this engine, but it
23 differs from other engines in that it also inserts other hooks into
24 "%^H" in the same scope that point to user-defined subroutines to use
25 during compilation, execution etc, these are described in "CALLBACKS"
28 The callbacks (e.g. "comp") then get called with a re::engine::Plugin
29 object as their first argument. This object provies access to perl's
30 internal REGEXP struct in addition to its own state (e.g. a stash). The
31 methods on this object allow for altering the "REGEXP" struct's internal
32 state, adding new callbacks, etc.
35 Callbacks are specified in the "re::engine::Plugin" import list as
36 key-value pairs of names and subroutine references:
38 use re::engine::Plugin (
44 To write a custom engine which imports your functions into the caller's
45 scope use use the following snippet:
47 package re::engine::Example;
48 use re::engine::Plugin ();
52 # Sets the caller's $^H{regcomp} his %^H with our callbacks
53 re::engine::Plugin->import(
60 *unimport = \&re::engine::Plugin::unimport;
62 # Implementation of the engine
73 # return value discarded
76 Called when a regex is compiled by perl, this is always the first
77 callback to be called and may be called multiple times or not at all
78 depending on what perl sees fit at the time.
80 The first argument will be a freshly constructed "re::engine::Plugin"
81 object (think of it as $self) which you can interact with using the
82 methods below, this object will be passed around the other callbacks and
83 methods for the lifetime of the regex.
85 Calling "die" or anything that uses it (such as "carp") here will not be
86 trapped by an "eval" block that the pattern is in, i.e.
89 use re::engine::Plugin(
92 croak "Your pattern is invalid"
93 unless $rx->pattern =~ /pony/;
97 # Ignores the eval block
98 eval { /you die in C<eval>, you die for real/ };
100 This happens because the real subroutine call happens indirectly at
101 compile time and not in the scope of the "eval" block. This is how
102 perl's own engine would behave in the same situation if given an invalid
103 pattern such as "/(/".
107 use re::engine::Plugin(
111 # We always like ponies!
112 if ($str =~ /pony/) {
122 Called when a regex is being executed, i.e. when it's being matched
123 against something. The scalar being matched against the pattern is
124 available as the second argument ($str) and through the str method. The
125 routine should return a true value if the match was successful, and a
126 false one if it wasn't.
128 This callback can also be specified on an individual basis with the
132 use re::engine::Plugin(
136 say 'matched ' ($ponies // 'no')
137 . ' pon' . ($ponies > 1 ? 'ies' : 'y');
143 Called when the regexp structure is freed by the perl interpreter. Note
144 that this happens pretty late in the destruction process, but still
145 before global destruction kicks in. The only argument this callback
146 receives is the "re::engine::Plugin" object associated with the regexp,
147 and its return value is ignored.
149 This callback can also be specified on an individual basis with the
155 # in comp/exec/methods:
158 The last scalar to be matched against the pattern or "undef" if there
159 hasn't been a match yet.
161 perl's own engine always stringifies the scalar being matched against a
162 given pattern, however a custom engine need not have such restrictions.
163 One could write a engine that matched a file handle against a pattern or
164 any other complex data structure.
167 The pattern that the engine was asked to compile, this can be either a
168 classic Perl pattern with modifiers like "/pat/ix" or "qr/pat/ix" or an
169 arbitary scalar. The latter allows for passing anything that doesn't fit
170 in a string and five modifier characters, such as hashrefs, objects,
175 say "has /ix" if %mod =~ 'i' and %mod =~ 'x';
177 A key-value pair list of the modifiers the pattern was compiled with.
178 The keys will zero or more of "imsxp" and the values will be true values
179 (so that you don't have to write "exists").
181 You don't get to know if the "eogc" modifiers were attached to the
182 pattern since these are internal to perl and shouldn't matter to regexp
186 comp => sub { shift->stash( [ 1 .. 5 ) },
187 exec => sub { shift->stash }, # Get [ 1 .. 5 ]
189 Returns or sets a user defined stash that's passed around as part of the
190 $rx object, useful for passing around all sorts of data between the
191 callback routines and methods.
195 my $minlen = $rx->minlen // "not set";
197 The minimum "length" a string must be to match the pattern, perl will
198 use this internally during matching to check whether the stringified
199 form of the string (or other object) being matched is at least this
200 long, if not the regexp engine in effect (that means you!) will not be
203 The length specified will be used as a a byte length (using SvPV), not a
209 # A dumb regexp engine that just tests string equality
210 use re::engine::Plugin comp => sub {
213 my $pat = $re->pattern;
223 Takes a list of key-value pairs of names and subroutines, and replace
224 the callback currently attached to the regular expression for the type
225 given as the key by the code reference passed as the corresponding
228 The only valid keys are currently "exec" and "free". See "exec" and
229 "free" for more details about these callbacks.
234 my ($re, $paren) = @_;
239 my ($re, $paren, $rhs) = @_;
241 # return value discarded
244 my ($re, $paren) = @_;
250 Takes a list of key-value pairs of names and subroutines that implement
251 numbered capture variables. "FETCH" will be called on value retrieval
252 ("say $1"), "STORE" on assignment ("$1 = "ook"") and "LENGTH" on "length
255 The second paramater of each routine is the paren number being
256 requested/stored, the following mapping applies for those numbers:
258 -2 => $` or ${^PREMATCH}
259 -1 => $' or ${^POSTMATCH}
264 Assignment to capture variables makes it possible to implement something
265 like Perl 6 ":rw" semantics, and since it's possible to make the capture
266 variables return any scalar instead of just a string it becomes possible
267 to implement Perl 6 match object semantics (to name an example).
272 perl internals still needs to be changed to support this but when it's
273 done it'll allow the binding of "%+" and "%-" and support the Tie::Hash
274 methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY and
279 True iff the module could have been built with thread-safety features
283 True iff this module could have been built with fork-safety features
284 enabled. This will always be true except on Windows where it's false for
285 perl 5.10.0 and below.
288 The only way to untaint an existing variable in Perl is to use it as a
289 hash key or referencing subpatterns from a regular expression match (see
290 perlsec), the latter only works in perl's regex engine because it
291 explicitly untaints capture variables which a custom engine will also
292 need to do if it wants its capture variables to be untanted.
294 There are basically two ways to go about this, the first and obvious one
295 is to make use of Perl'l lexical scoping which enables the use of its
296 built-in regex engine in the scope of the overriding engine's callbacks:
298 use re::engine::Plugin (
300 my ($re, $str) = @_; # $str is tainted
304 my ($re, $paren) = @_;
306 # This is perl's engine doing the match
309 # $1 has been untainted
316 The second is to use something like Taint::Util which flips the taint
317 flag on the scalar without invoking the perl's regex engine:
320 use re::engine::Plugin (
322 my ($re, $str) = @_; # $str is tainted
326 my ($re, $paren) = @_;
328 # Copy $str and untaint the copy
329 untaint(my $ret = $str);
331 # Return the untainted value
338 In either case a regex engine using perl's regex api or this module is
339 responsible for how and if it untaints its variables.
342 perlreapi, Taint::Util
347 * Engines implemented with this module don't support "s///" and "split
348 //", the appropriate parts of the "REGEXP" struct need to be wrapped
351 * Still not a complete wrapper for perlreapi in other ways, needs
352 methods for some "REGEXP" struct members, some callbacks aren't
355 * Support overloading operations on the "qr//" object, this allow
356 control over the of "qr//" objects in a manner that isn't limited by
361 '@{}' => sub { ... },
365 * Support the dispatch of arbitary methods from the re::engine::Plugin
366 qr// object to user defined subroutines via AUTOLOAD;
368 package re::engine::Plugin;
372 my ($name) = $AUTOLOAD =~ /.*::(.*?)/;
373 my $cv = getmeth($name); # or something like that
377 package re::engine::SomeEngine;
383 $re->add_method( # or something like that
385 my ($re, @arg) = @_; # re::engine::Plugin, 1..5
391 use re::engine::SomeEngine;
395 $re->foshizzle(1..5);
397 * Implement the dupe callback, test this on a threaded perl (and learn
398 how to use threads and how they break the current model).
400 * Allow the user to specify ->offs either as an array or a packed
401 string. Can pack() even pack I32? Only IV? int?
403 * Add tests that check for different behavior when curpm is and is not
406 * Add tests that check the refcount of the stash and other things I'm
407 mucking with, run valgrind and make sure everything is destroyed
410 * Run the debugger on the testsuite and find cases when the intuit and
411 checkstr callbacks are called. Write wrappers around them and add
417 A C compiler. This module may happen to build with a C++ compiler as
418 well, but don't rely on it, as no guarantee is made in this regard.
420 XSLoader (standard since perl 5.6.0).
423 Please report any bugs that aren't already listed at
424 <http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin> to
425 <http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin>
428 Ævar Arnfjörð Bjarmason "<avar at cpan.org>"
430 Vincent Pit "<perl at profvince.com>"
433 Copyright 2007,2008 Ævar Arnfjörð Bjarmason.
435 Copyright 2009,2010,2011,2013,2014,2015 Vincent Pit.
437 This program is free software; you can redistribute it and/or modify it
438 under the same terms as Perl itself.