2 re::engine::Plugin - API to write custom regex engines
5 As of perl 5.9.5 it's possible to lexically replace perl's built-in
6 regular expression engine with your own (see perlreapi and perlpragma).
7 This module provides a glue interface to the relevant parts of the perl
8 C API enabling you to write an engine in Perl instead of the C/XS
9 interface provided by the core.
12 Each regex in perl is compiled into an internal "REGEXP" structure (see
13 perlreapi), this can happen either during compile time in the case of
14 patterns in the format "/pattern/" or runtime for "qr//" patterns, or
15 something inbetween depending on variable interpolation etc.
17 When this module is loaded into a scope it inserts a hook into
18 $^H{regcomp} (as described in perlreapi and perlpragma) to have each
19 regexp constructed in its lexical scope handled by this engine, but it
20 differs from other engines in that it also inserts other hooks into
21 "%^H" in the same scope that point to user-defined subroutines to use
22 during compilation, execution etc, these are described in "CALLBACKS"
25 The callbacks (e.g. "comp") then get called with a re::engine::Plugin
26 object as their first argument. This object provies access to perl's
27 internal REGEXP struct in addition to its own state (e.g. a stash). The
28 methods on this object allow for altering the "REGEXP" struct's internal
29 state, adding new callbacks, etc.
32 Callbacks are specified in the "re::engine::Plugin" import list as
33 key-value pairs of names and subroutine references:
35 use re::engine::Plugin (
40 To write a custom engine which imports your functions into the caller's
41 scope use use the following snippet:
43 package re::engine::Example;
44 use re::engine::Plugin ();
48 # Sets the caller's $^H{regcomp} his %^H with our callbacks
49 re::engine::Plugin->import(
55 *unimport = \&re::engine::Plugin::unimport;
57 # Implementation of the engine
67 # return value discarded
70 Called when a regex is compiled by perl, this is always the first
71 callback to be called and may be called multiple times or not at all
72 depending on what perl sees fit at the time.
74 The first argument will be a freshly constructed "re::engine::Plugin"
75 object (think of it as $self) which you can interact with using the
76 methods below, this object will be passed around the other callbacks and
77 methods for the lifetime of the regex.
79 Calling "die" or anything that uses it (such as "carp") here will not be
80 trapped by an "eval" block that the pattern is in, i.e.
83 use re::engine::Plugin(
86 croak "Your pattern is invalid"
87 unless $rx->pattern ~~ /pony/;
91 # Ignores the eval block
92 eval { /you die in C<eval>, you die for real/ };
94 This happens because the real subroutine call happens indirectly at
95 compile time and not in the scope of the "eval" block. This is how
96 perl's own engine would behave in the same situation if given an invalid
97 pattern such as "/(/".
103 # We always like ponies!
104 return 1 if $str ~~ /pony/;
110 Called when a regex is being executed, i.e. when it's being matched
111 against something. The scalar being matched against the pattern is
112 available as the second argument ($str) and through the str method. The
113 routine should return a true value if the match was successful, and a
114 false one if it wasn't.
119 # in comp/exec/methods:
122 The last scalar to be matched against the pattern or "undef" if there
123 hasn't been a match yet.
125 perl's own engine always stringifies the scalar being matched against a
126 given pattern, however a custom engine need not have such restrictions.
127 One could write a engine that matched a file handle against a pattern or
128 any other complex data structure.
131 The pattern that the engine was asked to compile, this can be either a
132 classic Perl pattern with modifiers like "/pat/ix" or "qr/pat/ix" or an
133 arbitary scalar. The latter allows for passing anything that doesn't fit
134 in a string and five modifier characters, such as hashrefs, objects,
139 say "has /ix" if %mod ~~ 'i' and %mod ~~ 'x';
141 A key-value pair list of the modifiers the pattern was compiled with.
142 The keys will zero or more of "imsxp" and the values will be true values
143 (so that you don't have to write "exists").
145 You don't get to know if the "eogc" modifiers were attached to the
146 pattern since these are internal to perl and shouldn't matter to regexp
150 comp => sub { shift->stash( [ 1 .. 5 ) },
151 exec => sub { shift->stash }, # Get [ 1 .. 5 ]
153 Returns or sets a user defined stash that's passed around as part of the
154 $rx object, useful for passing around all sorts of data between the
155 callback routines and methods.
159 my $minlen = $rx->minlen // "not set";
161 The minimum "length" a string must be to match the pattern, perl will
162 use this internally during matching to check whether the stringified
163 form of the string (or other object) being matched is at least this
164 long, if not the regexp engine in effect (that means you!) will not be
167 The length specified will be used as a a byte length (using SvPV), not a
173 my ($re, $paren) = @_;
178 my ($re, $paren, $rhs) = @_;
180 # return value discarded
183 my ($re, $paren) = @_;
189 Takes a list of key-value pairs of names and subroutines that implement
190 numbered capture variables. "FETCH" will be called on value retrieval
191 ("say $1"), "STORE" on assignment ("$1 = "ook"") and "LENGTH" on "length
194 The second paramater of each routine is the paren number being
195 requested/stored, the following mapping applies for those numbers:
197 -2 => $` or ${^PREMATCH}
198 -1 => $' or ${^POSTMATCH}
203 Assignment to capture variables makes it possible to implement something
204 like Perl 6 ":rw" semantics, and since it's possible to make the capture
205 variables return any scalar instead of just a string it becomes possible
206 to implement Perl 6 match object semantics (to name an example).
211 perl internals still needs to be changed to support this but when it's
212 done it'll allow the binding of "%+" and "%-" and support the Tie::Hash
213 methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY and
217 The only way to untaint an existing variable in Perl is to use it as a
218 hash key or referencing subpatterns from a regular expression match (see
219 perlsec), the latter only works in perl's regex engine because it
220 explicitly untaints capture variables which a custom engine will also
221 need to do if it wants its capture variables to be untanted.
223 There are basically two ways to go about this, the first and obvious one
224 is to make use of Perl'l lexical scoping which enables the use of its
225 built-in regex engine in the scope of the overriding engine's callbacks:
227 use re::engine::Plugin (
229 my ($re, $str) = @_; # $str is tainted
233 my ($re, $paren) = @_;
235 # This is perl's engine doing the match
238 # $1 has been untainted
245 The second is to use something like Taint::Util which flips the taint
246 flag on the scalar without invoking the perl's regex engine:
249 use re::engine::Plugin (
251 my ($re, $str) = @_; # $str is tainted
255 my ($re, $paren) = @_;
257 # Copy $str and untaint the copy
258 untaint(my $ret = $str);
260 # Return the untainted value
267 In either case a regex engine using perl's regex api or this module is
268 responsible for how and if it untaints its variables.
271 perlreapi, Taint::Util
276 * Engines implemented with this module don't support "s///" and "split
277 //", the appropriate parts of the "REGEXP" struct need to be wrapped
280 * Still not a complete wrapper for perlreapi in other ways, needs
281 methods for some "REGEXP" struct members, some callbacks aren't
284 * Support overloading operations on the "qr//" object, this allow
285 control over the of "qr//" objects in a manner that isn't limited by
290 '@{}' => sub { ... },
294 * Support the dispatch of arbitary methods from the re::engine::Plugin
295 qr// object to user defined subroutines via AUTOLOAD;
297 package re::engine::Plugin;
301 my ($name) = $AUTOLOAD =~ /.*::(.*?)/;
302 my $cv = getmeth($name); # or something like that
306 package re::engine::SomeEngine;
312 $re->add_method( # or something like that
314 my ($re, @arg) = @_; # re::engine::Plugin, 1..5
320 use re::engine::SomeEngine;
324 $re->foshizzle(1..5);
326 * Implement the dupe callback, test this on a threaded perl (and learn
327 how to use threads and how they break the current model).
329 * Allow the user to specify ->offs either as an array or a packed
330 string. Can pack() even pack I32? Only IV? int?
332 * Add tests that check for different behavior when curpm is and is not
335 * Add tests that check the refcount of the stash and other things I'm
336 mucking with, run valgrind and make sure everything is destroyed
339 * Run the debugger on the testsuite and find cases when the intuit and
340 checkstr callbacks are called. Write wrappers around them and add
344 Please report any bugs that aren't already listed at
345 <http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin> to
346 <http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin>
349 Ævar Arnfjörð Bjarmason "<avar at cpan.org>"
351 Vincent Pit "<perl at profvince.com>"
354 Copyright 2007-2008 Ævar Arnfjörð Bjarmason.
356 Copyright 2009 Vincent Pit.
358 This program is free software; you can redistribute it and/or modify it
359 under the same terms as Perl itself.