X-Git-Url: http://git.vpit.fr/?p=perl%2Fmodules%2Fre-engine-Plugin.git;a=blobdiff_plain;f=Plugin.pod;h=d4bb1a246bed672d35878a3dc84e4dd310b9d49d;hp=a076caadbae2588085a5fe71cad12f25c37c1815;hb=2dd7bc5f80da4fe2220e28de1102641c239d084c;hpb=def98fc0d7f5e9527b28af6b90d4ddb07fbc845c diff --git a/Plugin.pod b/Plugin.pod index a076caa..d4bb1a2 100644 --- a/Plugin.pod +++ b/Plugin.pod @@ -1,198 +1,395 @@ =head1 NAME -re::engine::Plugin - Pure-Perl regular expression engine plugin interface +re::engine::Plugin - API to write custom regex engines -=head1 SYNOPSIS +=head1 DESCRIPTION - use feature ':5.10'; - use re::engine::Plugin ( - comp => sub { - my ($re) = @_; # A re::engine::Plugin object +As of perl 5.9.5 it's possible to lexically replace perl's built-in +regular expression engine with your own (see L and +L). This module provides a glue interface to the relevant +parts of the perl C API enabling you to write an engine in Perl +instead of the C/XS interface provided by the core. - # return value ignored - }, - exec => sub { - my ($re, $str) = @_; +=head2 The gory details + +Each regex in perl is compiled into an internal C structure +(see L), this can happen +either during compile time in the case of patterns in the format +C or runtime for C patterns, or something inbetween +depending on variable interpolation etc. + +When this module is loaded into a scope it inserts a hook into +C<$^H{regcomp}> (as described in L) to have each regexp +constructed in its lexical scope handled by this engine, but it +differs from other engines in that it also inserts other hooks into +C<%^H> in the same scope that point to user-defined subroutines to use +during compilation, execution etc, these are described in +L below. + +The callbacks (e.g. L) then get called with a +L object as their first argument. This object +provies access to perl's internal REGEXP struct in addition to its own +state (e.g. a L). The L on this object +allow for altering the C struct's internal state, adding new +callbacks, etc. - # We always like ponies! - return 1 if $str eq 'pony'; - return; - } +=head1 CALLBACKS + +Callbacks are specified in the C import list as +key-value pairs of names and subroutine references: + + use re::engine::Plugin ( + comp => sub {}, + exec => sub {}, ); - "pony" =~ /yummie/; +To write a custom engine which imports your functions into the +caller's scope use use the following snippet: -=head1 DESCRIPTION + package re::engine::Example; + use re::engine::Plugin (); + + sub import + { + # Populates the caller's %^H with our callbacks + re::engine::Plugin->import( + comp => \&comp, + exec => \&exec, + ); + } + + *unimport = \&re::engine::Plugin::unimport; -As of perl 5.9.5 it's possible lexically replace perl's built-in -regular expression engine (see L). This module provides glue for writing such a wrapper in -Perl instead of the provided C/XS interface. + # Implementation of the engine + sub comp { ... } + sub exec { ... } -B: This module is a development release that does not work with -any version of perl other than the current (as of February 2007) -I. The provided interface is not a complete wrapper around the -native interface (yet!) but the parts that are left can be implemented -with additional methods so the completed API shouldn't have any major -changes. + 1; + +=head2 comp + + comp => sub { + my ($rx) = @_; + + # return value discarded + } + +Called when a regex is compiled by perl, this is always the first +callback to be called and may be called multiple times or not at all +depending on what perl sees fit at the time. + +The first argument will be a freshly constructed C +object (think of it as C<$self>) which you can interact with using the +L below, this object will be passed around the other +L and L for the lifetime of +the regex. + +Calling C or anything that uses it (such as C) here will +not be trapped by an C block that the pattern is in, i.e. + + use Carp 'croak'; + use re::engine::Plugin( + comp => sub { + my $rx = shift; + croak "Your pattern is invalid" + unless $rx->pattern ~~ /pony/; + } + ); + + # Ignores the eval block + eval { /you die in C, you die for real/ }; + +This happens because the real subroutine call happens indirectly at +compile time and not in the scope of the C block. This is how +perl's own engine would behave in the same situation if given an +invalid pattern such as C. + +=head2 exec + + exec => sub { + my ($rx, $str) = @_; + + # We always like ponies! + return 1 if $str ~~ /pony/; + + # Failed to match + return; + } + +Called when a regex is being executed, i.e. when it's being matched +against something. The scalar being matched against the pattern is +available as the second argument (C<$str>) and through the L +method. The routine should return a true value if the match was +successful, and a false one if it wasn't. =head1 METHODS -=head2 import +=head2 str -Takes a list of key-value pairs with the only mandatory pair being -L and its callback routine. Both subroutine references and the -string name of a subroutine (e.g. C<"main::exec">) can be -specified. The real CODE ref is currently looked up in the symbol -table in the latter case. + "str" ~~ /pattern/; + # in comp/exec/methods: + my $str = $rx->str; -=over 4 +The last scalar to be matched against the L or +C if there hasn't been a match yet. -=item comp +perl's own engine always stringifies the scalar being matched against +a given pattern, however a custom engine need not have such +restrictions. One could write a engine that matched a file handle +against a pattern or any other complex data structure. -An optional sub to be called when a pattern is being compiled, note -that a single pattern may be compiled more than once by perl. +=head2 pattern -The subroutine will be called with a regexp object (see L). The regexp object will be stored internally along with the -pattern and provided as the first argument for the other callback -routines (think of it as C<$self>). +The pattern that the engine was asked to compile, this can be either a +classic Perl pattern with modifiers like C or C or +an arbitary scalar. The latter allows for passing anything that +doesn't fit in a string and five L characters, such as +hashrefs, objects, etc. -If your regex implementation needs to validate its pattern this is the -right place to B on an invalid one (but see L). +=head2 mod -The return value of this subroutine is discarded. + my %mod = $rx->mod; + say "has /ix" if $mod{i} and $mod{x}; -=item exec +A key-value pair list of the modifiers the pattern was compiled with. +The keys will zero or more of C and the values will be true +values (so that you don't have to write C). -Called when a given pattern is being executed, the first argument is -the regexp object and the second is the string being matched. The -routine should return true if the pattern matched and false if it -didn't. +You don't get to know if the C modifiers were attached to the +pattern since these are internal to perl and shouldn't matter to +regexp engines. -=item intuit +=head2 stash -TODO: implement + comp => sub { shift->stash( [ 1 .. 5 ) }, + exec => sub { shift->stash }, # Get [ 1 .. 5 ] -=item checkstr +Returns or sets a user defined stash that's passed around as part of +the C<$rx> object, useful for passing around all sorts of data between +the callback routines and methods. -TODO: implement +=head2 minlen -=item free + $rx->minlen($num); + my $minlen = $rx->minlen // "not set"; -TODO: implement +The minimum C a string must be to match the pattern, perl will +use this internally during matching to check whether the stringified +form of the string (or other object) being matched is at least this +long, if not the regexp engine in effect (that means you!) will not be +called at all. -=item dupe +The length specified will be used as a a byte length (using +L), not a character length. -TODO: implement +=head2 num_captures -=item numbered_buff_get + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; -TODO: implement + return "value"; + }, + STORE => sub { + my ($re, $paren, $rhs) = @_; -=item named_buff_get + # return value discarded + }, + LENGTH => sub { + my ($re, $paren) = @_; -TODO: implement + return 123; + }, + ); -=back +Takes a list of key-value pairs of names and subroutines that +implement numbered capture variables. C will be called on value +retrieval (C), C on assignment (C<$1 = "ook">) and +C on C. -=head2 flags +The second paramater of each routine is the paren number being +requested/stored, the following mapping applies for those numbers: -L + -2 => $` or ${^PREMATCH} + -1 => $' or ${^POSTMATCH} + 0 => $& or ${^MATCH} + 1 => $1 + # ... -=head1 TODO +Assignment to capture variables makes it possible to implement +something like Perl 6 C<:rw> semantics, and since it's possible to +make the capture variables return any scalar instead of just a string +it becomes possible to implement Perl 6 match object semantics (to +name an example). -=over +=head2 named_captures -=item * +B: implement -Provide an API for named (C<$+{name}>) and unnamed (C<$1, $2, ...>) -match variables, allow specifying both offsets into the pattern and -any given scalar. +perl internals still needs to be changed to support this but when it's +done it'll allow the binding of C<%+> and C<%-> and support the +L methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, +NEXTKEY and SCALAR. -=item * +=head1 Tainting -Find some neat example for the L, suggestions welcome. +The only way to untaint an existing variable in Perl is to use it as a +hash key or referencing subpatterns from a regular expression match +(see L), the +latter only works in perl's regex engine because it explicitly +untaints capture variables which a custom engine will also need to do +if it wants its capture variables to be untanted. -=back +There are basically two ways to go about this, the first and obvious +one is to make use of Perl'l lexical scoping which enables the use of +its built-in regex engine in the scope of the overriding engine's +callbacks: -=head1 BUGS + use re::engine::Plugin ( + exec => sub { + my ($re, $str) = @_; # $str is tainted -Please report any bugs that aren't already listed at -L to -L + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; + + # This is perl's engine doing the match + $str ~~ /(.*)/; + + # $1 has been untainted + return $1; + }, + ); + }, + ); -=over 1 +The second is to use something like L which flips the +taint flag on the scalar without invoking the perl's regex engine: -=item + use Taint::Util; + use re::engine::Plugin ( + exec => sub { + my ($re, $str) = @_; # $str is tainted -Calling C or anything that uses it (such as C) in the -L callback routines will not be trapped by an C block -that the pattern is in, i.e. + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; - use Carp qw(croak); - use re::engine::Plugin( - comp => sub { - my $re = shift; - croak "Your pattern is invalid" - unless $re->pattern =~ /pony/; - } - ); + # Copy $str and untaint the copy + untaint(my $ret = $str); - # Ignores the eval block - eval { /you die in C, you die for real/ }; + # Return the untainted value + return $ret; + }, + ); + }, + ); -Simply put this happens because the real subroutine call happens -indirectly and not in the scope of the C block. +In either case a regex engine using perl's L or +this module is responsible for how and if it untaints its variables. -=back +=head1 SEE ALSO -=head1 Regexp object +L, L -The regexp object is passed around as the first argument to all the -callback routines, it supports the following method calls (with more -to come!). +=head1 TODO / CAVEATS + +I =over -=item pattern +=item * -Returns the pattern this regexp was compiled with. +Engines implemented with this module don't support C and C, the appropriate parts of the C struct need to be wrapped +and documented. -=item flags +=item * -Returns a string of flags the pattern was compiled -with. (e.g. C<"xs">). The flags are not guarenteed to be in any -particular order, so don't depend on the current one. +Still not a complete wrapper for L in other ways, needs +methods for some C struct members, some callbacks aren't +implemented etc. -=item stash +=item * -Returns or sets a user-defined stash that's passed around with the -pattern, this is useful for passing around an arbitary scalar between -callback routines, example: +Support overloading operations on the C object, this allow +control over the of C objects in a manner that isn't limited by +C/C. - use re::engine::Plugin ( - comp => sub { $_[0]->stash( [ 1 .. 5 ] ) }, - comp => sub { $_[0]->stash }, # Get [ 1 .. 5] + $re->overload( + '""' => sub { ... }, + '@{}' => sub { ... }, + ... ); -=item minlen +=item * -The minimum length a given string must be to match the pattern, set -this to an integer in B and perl will not call your B -routine unless the string being matched as at least that long. Returns -the currently set length if not called with any arguments or C -if no length has been set. +Support the dispatch of arbitary methods from the re::engine::Plugin +qr// object to user defined subroutines via AUTOLOAD; -=back + package re::engine::Plugin; + sub AUTOLOAD + { + our $AUTOLOAD; + my ($name) = $AUTOLOAD =~ /.*::(.*?)/; + my $cv = getmeth($name); # or something like that + goto &$cv; + } -=head1 SEE ALSO + package re::engine::SomeEngine; -L + sub comp + { + my $re = shift; -=head1 THANKS + $re->add_method( # or something like that + foshizzle => sub { + my ($re, @arg) = @_; # re::engine::Plugin, 1..5 + }, + ); + } -Yves explaining why I made the regexp engine a sad panda. + package main; + use re::engine::SomeEngine; + later: + + my $re = qr//; + $re->foshizzle(1..5); + +=item * + +Implement the dupe callback, test this on a threaded perl (and learn +how to use threads and how they break the current model). + +=item * + +Allow the user to specify ->offs either as an array or a packed +string. Can pack() even pack I32? Only IV? int? + +=item * + +Add tests that check for different behavior when curpm is and is not +set. + +=item * + +Add tests that check the refcount of the stash and other things I'm +mucking with, run valgrind and make sure everything is destroyed when +it should. + +=item * + +Run the debugger on the testsuite and find cases when the intuit and +checkstr callbacks are called. Write wrappers around them and add +tests. + +=back + +=head1 BUGS + +Please report any bugs that aren't already listed at +L to +L =head1 AUTHOR @@ -200,9 +397,9 @@ Evar ArnfjErE Bjarmason =head1 LICENSE +Copyright 2007 Evar ArnfjErE Bjarmason. + This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. -Copyright 2007 Evar ArnfjErE Bjarmason. - =cut