--- /dev/null
+NAME
+ re::engine::Plugin - API to write custom regex engines
+
+DESCRIPTION
+ As of perl 5.9.5 it's possible to lexically replace perl's built-in
+ regular expression engine with your own (see perlreapi and perlpragma).
+ This module provides a glue interface to the relevant parts of the perl
+ C API enabling you to write an engine in Perl instead of the C/XS
+ interface provided by the core.
+
+ The gory details
+ Each regex in perl is compiled into an internal "REGEXP" structure (see
+ perlreapi), this can happen either during compile time in the case of
+ patterns in the format "/pattern/" or runtime for "qr//" patterns, or
+ something inbetween depending on variable interpolation etc.
+
+ When this module is loaded into a scope it inserts a hook into
+ $^H{regcomp} (as described in perlreapi and perlpragma) to have each
+ regexp constructed in its lexical scope handled by this engine, but it
+ differs from other engines in that it also inserts other hooks into
+ "%^H" in the same scope that point to user-defined subroutines to use
+ during compilation, execution etc, these are described in "CALLBACKS"
+ below.
+
+ The callbacks (e.g. "comp") then get called with a re::engine::Plugin
+ object as their first argument. This object provies access to perl's
+ internal REGEXP struct in addition to its own state (e.g. a stash). The
+ methods on this object allow for altering the "REGEXP" struct's internal
+ state, adding new callbacks, etc.
+
+CALLBACKS
+ Callbacks are specified in the "re::engine::Plugin" import list as
+ key-value pairs of names and subroutine references:
+
+ use re::engine::Plugin (
+ comp => sub {},
+ exec => sub {},
+ );
+
+ To write a custom engine which imports your functions into the caller's
+ scope use use the following snippet:
+
+ package re::engine::Example;
+ use re::engine::Plugin ();
+
+ sub import
+ {
+ # Sets the caller's $^H{regcomp} his %^H with our callbacks
+ re::engine::Plugin->import(
+ comp => \&comp,
+ exec => \&exec,
+ );
+ }
+
+ *unimport = \&re::engine::Plugin::unimport;
+
+ # Implementation of the engine
+ sub comp { ... }
+ sub exec { ... }
+
+ 1;
+
+ comp
+ comp => sub {
+ my ($rx) = @_;
+
+ # return value discarded
+ }
+
+ Called when a regex is compiled by perl, this is always the first
+ callback to be called and may be called multiple times or not at all
+ depending on what perl sees fit at the time.
+
+ The first argument will be a freshly constructed "re::engine::Plugin"
+ object (think of it as $self) which you can interact with using the
+ methods below, this object will be passed around the other callbacks and
+ methods for the lifetime of the regex.
+
+ Calling "die" or anything that uses it (such as "carp") here will not be
+ trapped by an "eval" block that the pattern is in, i.e.
+
+ use Carp 'croak';
+ use re::engine::Plugin(
+ comp => sub {
+ my $rx = shift;
+ croak "Your pattern is invalid"
+ unless $rx->pattern ~~ /pony/;
+ }
+ );
+
+ # Ignores the eval block
+ eval { /you die in C<eval>, you die for real/ };
+
+ This happens because the real subroutine call happens indirectly at
+ compile time and not in the scope of the "eval" block. This is how
+ perl's own engine would behave in the same situation if given an invalid
+ pattern such as "/(/".
+
+ exec
+ exec => sub {
+ my ($rx, $str) = @_;
+
+ # We always like ponies!
+ return 1 if $str ~~ /pony/;
+
+ # Failed to match
+ return;
+ }
+
+ Called when a regex is being executed, i.e. when it's being matched
+ against something. The scalar being matched against the pattern is
+ available as the second argument ($str) and through the str method. The
+ routine should return a true value if the match was successful, and a
+ false one if it wasn't.
+
+METHODS
+ str
+ "str" ~~ /pattern/;
+ # in comp/exec/methods:
+ my $str = $rx->str;
+
+ The last scalar to be matched against the pattern or "undef" if there
+ hasn't been a match yet.
+
+ perl's own engine always stringifies the scalar being matched against a
+ given pattern, however a custom engine need not have such restrictions.
+ One could write a engine that matched a file handle against a pattern or
+ any other complex data structure.
+
+ pattern
+ The pattern that the engine was asked to compile, this can be either a
+ classic Perl pattern with modifiers like "/pat/ix" or "qr/pat/ix" or an
+ arbitary scalar. The latter allows for passing anything that doesn't fit
+ in a string and five modifier characters, such as hashrefs, objects,
+ etc.
+
+ mod
+ my %mod = $rx->mod;
+ say "has /ix" if %mod ~~ 'i' and %mod ~~ 'x';
+
+ A key-value pair list of the modifiers the pattern was compiled with.
+ The keys will zero or more of "imsxp" and the values will be true values
+ (so that you don't have to write "exists").
+
+ You don't get to know if the "eogc" modifiers were attached to the
+ pattern since these are internal to perl and shouldn't matter to regexp
+ engines.
+
+ stash
+ comp => sub { shift->stash( [ 1 .. 5 ) },
+ exec => sub { shift->stash }, # Get [ 1 .. 5 ]
+
+ Returns or sets a user defined stash that's passed around as part of the
+ $rx object, useful for passing around all sorts of data between the
+ callback routines and methods.
+
+ minlen
+ $rx->minlen($num);
+ my $minlen = $rx->minlen // "not set";
+
+ The minimum "length" a string must be to match the pattern, perl will
+ use this internally during matching to check whether the stringified
+ form of the string (or other object) being matched is at least this
+ long, if not the regexp engine in effect (that means you!) will not be
+ called at all.
+
+ The length specified will be used as a a byte length (using SvPV), not a
+ character length.
+
+ num_captures
+ $re->num_captures(
+ FETCH => sub {
+ my ($re, $paren) = @_;
+
+ return "value";
+ },
+ STORE => sub {
+ my ($re, $paren, $rhs) = @_;
+
+ # return value discarded
+ },
+ LENGTH => sub {
+ my ($re, $paren) = @_;
+
+ return 123;
+ },
+ );
+
+ Takes a list of key-value pairs of names and subroutines that implement
+ numbered capture variables. "FETCH" will be called on value retrieval
+ ("say $1"), "STORE" on assignment ("$1 = "ook"") and "LENGTH" on "length
+ $1".
+
+ The second paramater of each routine is the paren number being
+ requested/stored, the following mapping applies for those numbers:
+
+ -2 => $` or ${^PREMATCH}
+ -1 => $' or ${^POSTMATCH}
+ 0 => $& or ${^MATCH}
+ 1 => $1
+ # ...
+
+ Assignment to capture variables makes it possible to implement something
+ like Perl 6 ":rw" semantics, and since it's possible to make the capture
+ variables return any scalar instead of just a string it becomes possible
+ to implement Perl 6 match object semantics (to name an example).
+
+ named_captures
+ TODO: implement
+
+ perl internals still needs to be changed to support this but when it's
+ done it'll allow the binding of "%+" and "%-" and support the Tie::Hash
+ methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY and
+ SCALAR.
+
+Tainting
+ The only way to untaint an existing variable in Perl is to use it as a
+ hash key or referencing subpatterns from a regular expression match (see
+ perlsec), the latter only works in perl's regex engine because it
+ explicitly untaints capture variables which a custom engine will also
+ need to do if it wants its capture variables to be untanted.
+
+ There are basically two ways to go about this, the first and obvious one
+ is to make use of Perl'l lexical scoping which enables the use of its
+ built-in regex engine in the scope of the overriding engine's callbacks:
+
+ use re::engine::Plugin (
+ exec => sub {
+ my ($re, $str) = @_; # $str is tainted
+
+ $re->num_captures(
+ FETCH => sub {
+ my ($re, $paren) = @_;
+
+ # This is perl's engine doing the match
+ $str ~~ /(.*)/;
+
+ # $1 has been untainted
+ return $1;
+ },
+ );
+ },
+ );
+
+ The second is to use something like Taint::Util which flips the taint
+ flag on the scalar without invoking the perl's regex engine:
+
+ use Taint::Util;
+ use re::engine::Plugin (
+ exec => sub {
+ my ($re, $str) = @_; # $str is tainted
+
+ $re->num_captures(
+ FETCH => sub {
+ my ($re, $paren) = @_;
+
+ # Copy $str and untaint the copy
+ untaint(my $ret = $str);
+
+ # Return the untainted value
+ return $ret;
+ },
+ );
+ },
+ );
+
+ In either case a regex engine using perl's regex api or this module is
+ responsible for how and if it untaints its variables.
+
+SEE ALSO
+ perlreapi, Taint::Util
+
+TODO / CAVEATS
+ *here be dragons*
+
+ * Engines implemented with this module don't support "s///" and "split
+ //", the appropriate parts of the "REGEXP" struct need to be wrapped
+ and documented.
+
+ * Still not a complete wrapper for perlreapi in other ways, needs
+ methods for some "REGEXP" struct members, some callbacks aren't
+ implemented etc.
+
+ * Support overloading operations on the "qr//" object, this allow
+ control over the of "qr//" objects in a manner that isn't limited by
+ "wrapped"/"wraplen".
+
+ $re->overload(
+ '""' => sub { ... },
+ '@{}' => sub { ... },
+ ...
+ );
+
+ * Support the dispatch of arbitary methods from the re::engine::Plugin
+ qr// object to user defined subroutines via AUTOLOAD;
+
+ package re::engine::Plugin;
+ sub AUTOLOAD
+ {
+ our $AUTOLOAD;
+ my ($name) = $AUTOLOAD =~ /.*::(.*?)/;
+ my $cv = getmeth($name); # or something like that
+ goto &$cv;
+ }
+
+ package re::engine::SomeEngine;
+
+ sub comp
+ {
+ my $re = shift;
+
+ $re->add_method( # or something like that
+ foshizzle => sub {
+ my ($re, @arg) = @_; # re::engine::Plugin, 1..5
+ },
+ );
+ }
+
+ package main;
+ use re::engine::SomeEngine;
+ later:
+
+ my $re = qr//;
+ $re->foshizzle(1..5);
+
+ * Implement the dupe callback, test this on a threaded perl (and learn
+ how to use threads and how they break the current model).
+
+ * Allow the user to specify ->offs either as an array or a packed
+ string. Can pack() even pack I32? Only IV? int?
+
+ * Add tests that check for different behavior when curpm is and is not
+ set.
+
+ * Add tests that check the refcount of the stash and other things I'm
+ mucking with, run valgrind and make sure everything is destroyed
+ when it should.
+
+ * Run the debugger on the testsuite and find cases when the intuit and
+ checkstr callbacks are called. Write wrappers around them and add
+ tests.
+
+BUGS
+ Please report any bugs that aren't already listed at
+ <http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin> to
+ <http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin>
+
+AUTHORS
+ Ævar Arnfjörð Bjarmason "<avar at cpan.org>"
+
+ Vincent Pit "<perl at profvince.com>"
+
+LICENSE
+ Copyright 2007-2008 Ævar Arnfjörð Bjarmason.
+
+ Copyright 2009 Vincent Pit.
+
+ This program is free software; you can redistribute it and/or modify it
+ under the same terms as Perl itself.
+