=head1 NAME
-re::engine::Plugin - Pure-Perl regular expression engine plugin interface
+re::engine::Plugin - API to write custom regex engines
-=head1 SYNOPSIS
+=head1 NOTICE
- use feature ':5.10';
- use re::engine::Plugin (
- comp => sub {
- my ($re) = @_; # A re::engine::Plugin object
+This is a B<developer release> that requires a patch to blead to work,
+the patch can be found in F<named_capture.patch> in this distribution.
- # return value ignored
- },
- exec => sub {
- my ($re, $str) = @_;
+=head1 DESCRIPTION
+
+As of perl 5.9.5 it's possible to lexically replace perl's built-in
+regular expression engine with your own (see L<perlreapi> and
+L<perlpragma>). This module provides a glue interface to the relevant
+parts of the perl C API enabling you to write an engine in Perl
+instead of the C/XS interface provided by the core.
+
+=head2 The gory details
+
+Each regex in perl is compiled into an internal C<REGEXP> structure
+(see L<perlreapi|perlreapi/The REGEXP structure>), this can happen
+either during compile time in the case of patterns in the format
+C</pattern/> or runtime for C<qr//> patterns, or something inbetween
+depending on variable interpolation etc.
- # We always like ponies!
- return 1 if $str eq 'pony';
- return;
- }
+When this module is loaded into a scope it inserts a hook into
+C<$^H{regcomp}> (as described in L<perlreapi>) to have each regexp
+constructed in its lexical scope handled by this engine, but it
+differs from other engines in that it also inserts other hooks into
+C<%^H> in the same scope that point to user-defined subroutines to use
+during compilation, execution etc, these are described in
+L</CALLBACKS> below.
+
+The callbacks (e.g. L</comp>) then get called with a
+L<re::engine::Plugin> object as their first argument. This object
+provies access to perl's internal REGEXP struct in addition to its own
+state (e.g. a L<stash|/stash>). The L<methods|/METHODS> on this object
+allow for altering the C<REGEXP> struct's internal state, adding new
+callbacks, etc.
+
+=head1 CALLBACKS
+
+Callbacks are specified in the C<re::engine::Plugin> import list as
+key-value pairs of names and subroutine references:
+
+ use re::engine::Plugin (
+ comp => sub {},
+ exec => sub {},
);
- "pony" =~ /yummie/;
+To write a custom engine which imports your functions into the
+caller's scope use use the following snippet:
-=head1 DESCRIPTION
+ package re::engine::Example;
+ use re::engine::Plugin ();
+
+ sub import
+ {
+ # Populates the caller's %^H with our callbacks
+ re::engine::Plugin->import(
+ comp => \&comp,
+ exec => \&exec,
+ );
+ }
+
+ *unimport = \&re::engine::Plugin::unimport;
+
+ # Implementation of the engine
+ sub comp { ... }
+ sub exec { ... }
+
+ 1;
+
+=head2 comp
+
+ comp => sub {
+ my ($rx) = @_;
+
+ # return value discarded
+ }
+
+Called when a regex is compiled by perl, this is always the first
+callback to be called and may be called multiple times or not at all
+depending on what perl sees fit at the time.
+
+The first argument will be a freshly constructed C<re::engine::Plugin>
+object (think of it as C<$self>) which you can interact with using the
+L<methods|/METHODS> below, this object will be passed around the other
+L<callbacks|/CALLBACKS> and L<methods|/METHODS> for the lifetime of
+the regex.
+
+Calling C<die> or anything that uses it (such as C<carp>) here will
+not be trapped by an C<eval> block that the pattern is in, i.e.
+
+ use Carp 'croak';
+ use re::engine::Plugin(
+ comp => sub {
+ my $rx = shift;
+ croak "Your pattern is invalid"
+ unless $rx->pattern ~~ /pony/;
+ }
+ );
+
+ # Ignores the eval block
+ eval { /you die in C<eval>, you die for real/ };
+
+This happens because the real subroutine call happens indirectly at
+compile time and not in the scope of the C<eval> block. This is how
+perl's own engine would behave in the same situation if given an
+invalid pattern such as C</(/>.
+
+=head2 exec
-As of perl 5.9.5 it's possible lexically replace perl's built-in
-regular expression engine (see L<perlreguts|perlreguts/"Pluggable
-Interface">). This module provides glue for writing such a wrapper in
-Perl instead of the provided C/XS interface.
+ exec => sub {
+ my ($rx, $str) = @_;
-B<NOTE>: This module is a development release that does not work with
-any version of perl other than the current (as of February 2007)
-I<blead>. The provided interface is not a complete wrapper around the
-native interface (yet!) but the parts that are left can be implemented
-with additional methods so the completed API shouldn't have any major
-changes.
+ # We always like ponies!
+ return 1 if $str ~~ /pony/;
+
+ # Failed to match
+ return;
+ }
+
+Called when a regex is being executed, i.e. when it's being matched
+against something. The scalar being matched against the pattern is
+available as the second argument (C<$str>) and through the L<str|/str>
+method. The routine should return a true value if the match was
+successful, and a false one if it wasn't.
=head1 METHODS
-=head2 import
+=head2 str
-Takes a list of key-value pairs with the only mandatory pair being
-L</exec> and its callback routine. Both subroutine references and the
-string name of a subroutine (e.g. C<"main::exec">) can be
-specified. The real CODE ref is currently looked up in the symbol
-table in the latter case.
+ "str" ~~ /pattern/;
+ # in comp/exec/methods:
+ my $str = $rx->str;
-=over 4
+The last scalar to be matched against the L<pattern|/pattern> or
+C<undef> if there hasn't been a match yet.
-=item comp
+perl's own engine always stringifies the scalar being matched against
+a given pattern, however a custom engine need not have such
+restrictions. One could write a engine that matched a file handle
+against a pattern or any other complex data structure.
-An optional sub to be called when a pattern is being compiled, note
-that a single pattern may be compiled more than once by perl.
+=head2 pattern
-The subroutine will be called with a regexp object (see L</Regexp
-object>). The regexp object will be stored internally along with the
-pattern and provided as the first argument for the other callback
-routines (think of it as C<$self>).
+The pattern that the engine was asked to compile, this can be either a
+classic Perl pattern with modifiers like C</pat/ix> or C<qr/pat/ix> or
+an arbitary scalar. The latter allows for passing anything that
+doesn't fit in a string and five L<modifier|/mod> characters, such as
+hashrefs, objects, etc.
-If your regex implementation needs to validate its pattern this is the
-right place to B<croak> on an invalid one (but see L</BUGS>).
+=head2 mod
-The return value of this subroutine is discarded.
+ my %mod = $rx->mod;
+ say "has /ix" if %mod ~~ 'i' and %mod ~~ 'x';
-=item exec
+A key-value pair list of the modifiers the pattern was compiled with.
+The keys will zero or more of C<imsxp> and the values will be true
+values (so that you don't have to write C<exists>).
-Called when a given pattern is being executed, the first argument is
-the regexp object and the second is the string being matched. The
-routine should return true if the pattern matched and false if it
-didn't.
+You don't get to know if the C<eogc> modifiers were attached to the
+pattern since these are internal to perl and shouldn't matter to
+regexp engines.
-=item intuit
+=head2 stash
-TODO: implement
+ comp => sub { shift->stash( [ 1 .. 5 ) },
+ exec => sub { shift->stash }, # Get [ 1 .. 5 ]
-=item checkstr
+Returns or sets a user defined stash that's passed around as part of
+the C<$rx> object, useful for passing around all sorts of data between
+the callback routines and methods.
-TODO: implement
+=head2 minlen
-=item free
+ $rx->minlen($num);
+ my $minlen = $rx->minlen // "not set";
-TODO: implement
+The minimum C<length> a string must be to match the pattern, perl will
+use this internally during matching to check whether the stringified
+form of the string (or other object) being matched is at least this
+long, if not the regexp engine in effect (that means you!) will not be
+called at all.
-=item dupe
+The length specified will be used as a a byte length (using
+L<SvPV|perlapi/SvPV>), not a character length.
-TODO: implement
+=head2 num_captures
-=item numbered_buff_get
+ $re->num_captures(
+ FETCH => sub {
+ my ($re, $paren) = @_;
-TODO: implement
+ return "value";
+ },
+ STORE => sub {
+ my ($re, $paren, $rhs) = @_;
-=item named_buff_get
+ # return value discarded
+ },
+ LENGTH => sub {
+ my ($re, $paren) = @_;
-TODO: implement
+ return 123;
+ },
+ );
-=back
+Takes a list of key-value pairs of names and subroutines that
+implement numbered capture variables. C<FETCH> will be called on value
+retrieval (C<say $1>), C<STORE> on assignment (C<$1 = "ook">) and
+C<LENGTH> on C<length $1>.
-=head2 flags
+The second paramater of each routine is the paren number being
+requested/stored, the following mapping applies for those numbers:
-L<perlop/"/PATTERN/cgimosx">
+ -2 => $` or ${^PREMATCH}
+ -1 => $' or ${^POSTMATCH}
+ 0 => $& or ${^MATCH}
+ 1 => $1
+ # ...
-=head1 TODO
+Assignment to capture variables makes it possible to implement
+something like Perl 6 C<:rw> semantics, and since it's possible to
+make the capture variables return any scalar instead of just a string
+it becomes possible to implement Perl 6 match object semantics (to
+name an example).
-=over
+=head2 named_captures
-=item *
+B<TODO>: document
-Provide an API for named (C<$+{name}>) and unnamed (C<$1, $2, ...>)
-match variables, allow specifying both offsets into the pattern and
-any given scalar.
+This is implemented but not documented, see F<t/named_buff> for usage
+examples.
-=item *
+=head1 Tainting
-Find some neat example for the L</SYNOPSIS>, suggestions welcome.
+The only way to untaint an existing variable in Perl is to use it as a
+hash key or referencing subpatterns from a regular expression match
+(see L<perlsec|perlsec/Laundering and Detecting Tainted Data>), the
+latter only works in perl's regex engine because it explicitly
+untaints capture variables which a custom engine will also need to do
+if it wants its capture variables to be untanted.
-=back
+There are basically two ways to go about this, the first and obvious
+one is to make use of Perl'l lexical scoping which enables the use of
+its built-in regex engine in the scope of the overriding engine's
+callbacks:
-=head1 BUGS
+ use re::engine::Plugin (
+ exec => sub {
+ my ($re, $str) = @_; # $str is tainted
-Please report any bugs that aren't already listed at
-L<http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin> to
-L<http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin>
+ $re->num_captures(
+ FETCH => sub {
+ my ($re, $paren) = @_;
-=over 1
+ # This is perl's engine doing the match
+ $str ~~ /(.*)/;
-=item
+ # $1 has been untainted
+ return $1;
+ },
+ );
+ },
+ );
-Calling C<die> or anything that uses it (such as C<carp>) in the
-L</comp> callback routines will not be trapped by an C<eval> block
-that the pattern is in, i.e.
+The second is to use something like L<Taint::Util> which flips the
+taint flag on the scalar without invoking the perl's regex engine:
- use Carp qw(croak);
- use re::engine::Plugin(
- comp => sub {
- my $re = shift;
- croak "Your pattern is invalid"
- unless $re->pattern =~ /pony/;
- }
- );
+ use Taint::Util;
+ use re::engine::Plugin (
+ exec => sub {
+ my ($re, $str) = @_; # $str is tainted
- # Ignores the eval block
- eval { /you die in C<eval>, you die for real/ };
+ $re->num_captures(
+ FETCH => sub {
+ my ($re, $paren) = @_;
-Simply put this happens because the real subroutine call happens
-indirectly and not in the scope of the C<eval> block.
+ # Copy $str and untaint the copy
+ untaint(my $ret = $str);
-=back
+ # Return the untainted value
+ return $ret;
+ },
+ );
+ },
+ );
+
+In either case a regex engine using perl's L<regex api|perlapi> or
+this module is responsible for how and if it untaints its variables.
+
+=head1 SEE ALSO
-=head1 Regexp object
+L<perlreapi>, L<Taint::Util>
-The regexp object is passed around as the first argument to all the
-callback routines, it supports the following method calls (with more
-to come!).
+=head1 TODO / CAVEATS
+
+I<here be dragons>
=over
-=item pattern
+=item *
-Returns the pattern this regexp was compiled with.
+Export constants defined as macros in core relevant to our interests,
+e.g. PMf_ stuff and things needed by extflags.
-=item flags
+=item *
-Returns a string of flags the pattern was compiled
-with. (e.g. C<"xs">). The flags are not guarenteed to be in any
-particular order, so don't depend on the current one.
+Engines implemented with this module don't support C<s///> and C<split
+//>, the appropriate parts of the C<REGEXP> struct need to be wrapped
+and documented.
-=item stash
+=item *
-Returns or sets a user-defined stash that's passed around with the
-pattern, this is useful for passing around an arbitary scalar between
-callback routines, example:
+Still not a complete wrapper for L<perlreapi> in other ways, needs
+methods for some C<REGEXP> struct members, some callbacks aren't
+implemented etc.
- use re::engine::Plugin (
- comp => sub { $_[0]->stash( [ 1 .. 5 ] ) },
- comp => sub { $_[0]->stash }, # Get [ 1 .. 5]
+=item *
+
+Support overloading operations on the C<qr//> object, this allow
+control over the of C<qr//> objects in a manner that isn't limited by
+C<wrapped>/C<wraplen>.
+
+ $re->overload(
+ '""' => sub { ... },
+ '@{}' => sub { ... },
+ ...
);
-=item minlen
+=item *
-The minimum length a given string must be to match the pattern, set
-this to an integer in B<comp> and perl will not call your B<exec>
-routine unless the string being matched as at least that long. Returns
-the currently set length if not called with any arguments or C<undef>
-if no length has been set.
+Support the dispatch of arbitary methods from the re::engine::Plugin
+qr// object to user defined subroutines via AUTOLOAD;
-=back
+ package re::engine::Plugin;
+ sub AUTOLOAD
+ {
+ our $AUTOLOAD;
+ my ($name) = $AUTOLOAD =~ /.*::(.*?)/;
+ my $cv = getmeth($name); # or something like that
+ goto &$cv;
+ }
-=head1 SEE ALSO
+ package re::engine::SomeEngine;
+
+ sub comp
+ {
+ my $re = shift;
-L<perlreguts/Pluggable Interface>
+ $re->add_method( # or something like that
+ foshizzle => sub {
+ my ($re, @arg) = @_; # re::engine::Plugin, 1..5
+ },
+ );
+ }
-=head1 THANKS
+ package main;
+ use re::engine::SomeEngine;
+ later:
-Yves explaining why I made the regexp engine a sad panda.
+ my $re = qr//;
+ $re->foshizzle(1..5);
+
+=item *
+
+Implement the dupe callback, test this on a threaded perl (and learn
+how to use threads and how they break the current model).
+
+=item *
+
+Allow the user to specify ->offs either as an array or a packed
+string. Can pack() even pack I32? Only IV? int?
+
+=item *
+
+Add tests that check for different behavior when curpm is and is not
+set.
+
+=item *
+
+Add tests that check the refcount of the stash and other things I'm
+mucking with, run valgrind and make sure everything is destroyed when
+it should.
+
+=item *
+
+Run the debugger on the testsuite and find cases when the intuit and
+checkstr callbacks are called. Write wrappers around them and add
+tests.
+
+=back
+
+=head1 BUGS
+
+Please report any bugs that aren't already listed at
+L<http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin> to
+L<http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin>
=head1 AUTHOR
=head1 LICENSE
+Copyright 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
+
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
-Copyright 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
-
=cut