X-Git-Url: http://git.vpit.fr/?p=perl%2Fmodules%2Fre-engine-Plugin.git;a=blobdiff_plain;f=Plugin.pod;h=98633c4fadebf42436e1c8f1643fbb46da846336;hp=a076caadbae2588085a5fe71cad12f25c37c1815;hb=HEAD;hpb=def98fc0d7f5e9527b28af6b90d4ddb07fbc845c diff --git a/Plugin.pod b/Plugin.pod index a076caa..98633c4 100644 --- a/Plugin.pod +++ b/Plugin.pod @@ -1,208 +1,496 @@ =head1 NAME -re::engine::Plugin - Pure-Perl regular expression engine plugin interface +re::engine::Plugin - API to write custom regex engines -=head1 SYNOPSIS +=head1 VERSION + +Version 0.12 + +=head1 DESCRIPTION + +As of perl 5.9.5 it's possible to lexically replace perl's built-in +regular expression engine with your own (see L and +L). This module provides a glue interface to the relevant +parts of the perl C API enabling you to write an engine in Perl +instead of the C/XS interface provided by the core. + +=head2 The gory details + +Each regex in perl is compiled into an internal C structure +(see L), this can happen +either during compile time in the case of patterns in the format +C or runtime for C patterns, or something inbetween +depending on variable interpolation etc. + +When this module is loaded into a scope it inserts a hook into +C<$^H{regcomp}> (as described in L and L) to +have each regexp constructed in its lexical scope handled by this +engine, but it differs from other engines in that it also inserts +other hooks into C<%^H> in the same scope that point to user-defined +subroutines to use during compilation, execution etc, these are +described in L below. + +The callbacks (e.g. L) then get called with a +L object as their first argument. This object +provies access to perl's internal REGEXP struct in addition to its own +state (e.g. a L). The L on this object +allow for altering the C struct's internal state, adding new +callbacks, etc. + +=head1 CALLBACKS + +Callbacks are specified in the C import list as +key-value pairs of names and subroutine references: - use feature ':5.10'; use re::engine::Plugin ( - comp => sub { - my ($re) = @_; # A re::engine::Plugin object + comp => sub {}, + exec => sub {}, + free => sub {}, + ); - # return value ignored - }, +To write a custom engine which imports your functions into the +caller's scope use use the following snippet: + + package re::engine::Example; + use re::engine::Plugin (); + + sub import + { + # Sets the caller's $^H{regcomp} his %^H with our callbacks + re::engine::Plugin->import( + comp => \&comp, + exec => \&exec, + free => \&free, + ); + } + + *unimport = \&re::engine::Plugin::unimport; + + # Implementation of the engine + sub comp { ... } + sub exec { ... } + sub free { ... } + + 1; + +=head2 comp + + comp => sub { + my ($rx) = @_; + + # return value discarded + } + +Called when a regex is compiled by perl, this is always the first +callback to be called and may be called multiple times or not at all +depending on what perl sees fit at the time. + +The first argument will be a freshly constructed C +object (think of it as C<$self>) which you can interact with using the +L below, this object will be passed around the other +L and L for the lifetime of +the regex. + +Calling C or anything that uses it (such as C) here will +not be trapped by an C block that the pattern is in, i.e. + + use Carp 'croak'; + use re::engine::Plugin( + comp => sub { + my $rx = shift; + croak "Your pattern is invalid" + unless $rx->pattern =~ /pony/; + } + ); + + # Ignores the eval block + eval { /you die in C, you die for real/ }; + +This happens because the real subroutine call happens indirectly at +compile time and not in the scope of the C block. This is how +perl's own engine would behave in the same situation if given an +invalid pattern such as C. + +=head2 exec + + my $ponies; + use re::engine::Plugin( exec => sub { - my ($re, $str) = @_; + my ($rx, $str) = @_; + + # We always like ponies! + if ($str =~ /pony/) { + $ponies++; + return 1; + } - # We always like ponies! - return 1 if $str eq 'pony'; - return; + # Failed to match + return; } ); - "pony" =~ /yummie/; +Called when a regex is being executed, i.e. when it's being matched +against something. The scalar being matched against the pattern is +available as the second argument (C<$str>) and through the L +method. The routine should return a true value if the match was +successful, and a false one if it wasn't. -=head1 DESCRIPTION +This callback can also be specified on an individual basis with the +L method. + +=head2 free + + use re::engine::Plugin( + free => sub { + my ($rx) = @_; + + say 'matched ' ($ponies // 'no') + . ' pon' . ($ponies > 1 ? 'ies' : 'y'); -As of perl 5.9.5 it's possible lexically replace perl's built-in -regular expression engine (see L). This module provides glue for writing such a wrapper in -Perl instead of the provided C/XS interface. + return; + } + ); -B: This module is a development release that does not work with -any version of perl other than the current (as of February 2007) -I. The provided interface is not a complete wrapper around the -native interface (yet!) but the parts that are left can be implemented -with additional methods so the completed API shouldn't have any major -changes. +Called when the regexp structure is freed by the perl interpreter. +Note that this happens pretty late in the destruction process, but +still before global destruction kicks in. The only argument this +callback receives is the C object associated +with the regexp, and its return value is ignored. + +This callback can also be specified on an individual basis with the +L method. =head1 METHODS -=head2 import +=head2 str -Takes a list of key-value pairs with the only mandatory pair being -L and its callback routine. Both subroutine references and the -string name of a subroutine (e.g. C<"main::exec">) can be -specified. The real CODE ref is currently looked up in the symbol -table in the latter case. + "str" =~ /pattern/; + # in comp/exec/methods: + my $str = $rx->str; -=over 4 +The last scalar to be matched against the L or +C if there hasn't been a match yet. -=item comp +perl's own engine always stringifies the scalar being matched against +a given pattern, however a custom engine need not have such +restrictions. One could write a engine that matched a file handle +against a pattern or any other complex data structure. -An optional sub to be called when a pattern is being compiled, note -that a single pattern may be compiled more than once by perl. +=head2 pattern -The subroutine will be called with a regexp object (see L). The regexp object will be stored internally along with the -pattern and provided as the first argument for the other callback -routines (think of it as C<$self>). +The pattern that the engine was asked to compile, this can be either a +classic Perl pattern with modifiers like C or C or +an arbitary scalar. The latter allows for passing anything that +doesn't fit in a string and five L characters, such as +hashrefs, objects, etc. -If your regex implementation needs to validate its pattern this is the -right place to B on an invalid one (but see L). +=head2 mod -The return value of this subroutine is discarded. + my %mod = $rx->mod; + say "has /ix" if %mod =~ 'i' and %mod =~ 'x'; -=item exec +A key-value pair list of the modifiers the pattern was compiled with. +The keys will zero or more of C and the values will be true +values (so that you don't have to write C). -Called when a given pattern is being executed, the first argument is -the regexp object and the second is the string being matched. The -routine should return true if the pattern matched and false if it -didn't. +You don't get to know if the C modifiers were attached to the +pattern since these are internal to perl and shouldn't matter to +regexp engines. -=item intuit +=head2 stash -TODO: implement + comp => sub { shift->stash( [ 1 .. 5 ) }, + exec => sub { shift->stash }, # Get [ 1 .. 5 ] -=item checkstr +Returns or sets a user defined stash that's passed around as part of +the C<$rx> object, useful for passing around all sorts of data between +the callback routines and methods. -TODO: implement +=head2 minlen -=item free + $rx->minlen($num); + my $minlen = $rx->minlen // "not set"; -TODO: implement +The minimum C a string must be to match the pattern, perl will +use this internally during matching to check whether the stringified +form of the string (or other object) being matched is at least this +long, if not the regexp engine in effect (that means you!) will not be +called at all. -=item dupe +The length specified will be used as a a byte length (using +L), not a character length. -TODO: implement +=head2 nparens -=item numbered_buff_get +=head2 gofs -TODO: implement +=head2 callbacks -=item named_buff_get + # A dumb regexp engine that just tests string equality + use re::engine::Plugin comp => sub { + my ($re) = @_; -TODO: implement + my $pat = $re->pattern; -=back + $re->callbacks( + exec => sub { + my ($re, $str) = @_; + return $pat eq $str; + }, + ); + }; -=head2 flags +Takes a list of key-value pairs of names and subroutines, and replace the +callback currently attached to the regular expression for the type given as +the key by the code reference passed as the corresponding value. -L +The only valid keys are currently C and C. See L and +L for more details about these callbacks. -=head1 TODO +=head2 num_captures -=over + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; -=item * + return "value"; + }, + STORE => sub { + my ($re, $paren, $rhs) = @_; -Provide an API for named (C<$+{name}>) and unnamed (C<$1, $2, ...>) -match variables, allow specifying both offsets into the pattern and -any given scalar. + # return value discarded + }, + LENGTH => sub { + my ($re, $paren) = @_; -=item * + return 123; + }, + ); -Find some neat example for the L, suggestions welcome. +Takes a list of key-value pairs of names and subroutines that +implement numbered capture variables. C will be called on value +retrieval (C), C on assignment (C<$1 = "ook">) and +C on C. -=back +The second paramater of each routine is the paren number being +requested/stored, the following mapping applies for those numbers: -=head1 BUGS + -2 => $` or ${^PREMATCH} + -1 => $' or ${^POSTMATCH} + 0 => $& or ${^MATCH} + 1 => $1 + # ... -Please report any bugs that aren't already listed at -L to -L +Assignment to capture variables makes it possible to implement +something like Perl 6 C<:rw> semantics, and since it's possible to +make the capture variables return any scalar instead of just a string +it becomes possible to implement Perl 6 match object semantics (to +name an example). -=over 1 +=head2 named_captures -=item +B: implement -Calling C or anything that uses it (such as C) in the -L callback routines will not be trapped by an C block -that the pattern is in, i.e. +perl internals still needs to be changed to support this but when it's +done it'll allow the binding of C<%+> and C<%-> and support the +L methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, +NEXTKEY and SCALAR. - use Carp qw(croak); - use re::engine::Plugin( - comp => sub { - my $re = shift; - croak "Your pattern is invalid" - unless $re->pattern =~ /pony/; - } - ); +=head1 CONSTANTS - # Ignores the eval block - eval { /you die in C, you die for real/ }; +=head2 C -Simply put this happens because the real subroutine call happens -indirectly and not in the scope of the C block. +True iff the module could have been built with thread-safety features +enabled. -=back +=head2 C -=head1 Regexp object +True iff this module could have been built with fork-safety features +enabled. This will always be true except on Windows where it's false +for perl 5.10.0 and below. -The regexp object is passed around as the first argument to all the -callback routines, it supports the following method calls (with more -to come!). +=head1 TAINTING -=over +The only way to untaint an existing variable in Perl is to use it as a +hash key or referencing subpatterns from a regular expression match +(see L), the +latter only works in perl's regex engine because it explicitly +untaints capture variables which a custom engine will also need to do +if it wants its capture variables to be untanted. -=item pattern +There are basically two ways to go about this, the first and obvious +one is to make use of Perl'l lexical scoping which enables the use of +its built-in regex engine in the scope of the overriding engine's +callbacks: -Returns the pattern this regexp was compiled with. + use re::engine::Plugin ( + exec => sub { + my ($re, $str) = @_; # $str is tainted -=item flags + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; -Returns a string of flags the pattern was compiled -with. (e.g. C<"xs">). The flags are not guarenteed to be in any -particular order, so don't depend on the current one. + # This is perl's engine doing the match + $str =~ /(.*)/; -=item stash + # $1 has been untainted + return $1; + }, + ); + }, + ); -Returns or sets a user-defined stash that's passed around with the -pattern, this is useful for passing around an arbitary scalar between -callback routines, example: +The second is to use something like L which flips the +taint flag on the scalar without invoking the perl's regex engine: + use Taint::Util; use re::engine::Plugin ( - comp => sub { $_[0]->stash( [ 1 .. 5 ] ) }, - comp => sub { $_[0]->stash }, # Get [ 1 .. 5] + exec => sub { + my ($re, $str) = @_; # $str is tainted + + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; + + # Copy $str and untaint the copy + untaint(my $ret = $str); + + # Return the untainted value + return $ret; + }, + ); + }, + ); + +In either case a regex engine using perl's L or +this module is responsible for how and if it untaints its variables. + +=head1 SEE ALSO + +L, L + +=head1 TODO & CAVEATS + +I + +=over + +=item * + +Engines implemented with this module don't support C and C, the appropriate parts of the C struct need to be wrapped +and documented. + +=item * + +Still not a complete wrapper for L in other ways, needs +methods for some C struct members, some callbacks aren't +implemented etc. + +=item * + +Support overloading operations on the C object, this allow +control over the of C objects in a manner that isn't limited by +C/C. + + $re->overload( + '""' => sub { ... }, + '@{}' => sub { ... }, + ... ); -=item minlen +=item * + +Support the dispatch of arbitary methods from the re::engine::Plugin +qr// object to user defined subroutines via AUTOLOAD; -The minimum length a given string must be to match the pattern, set -this to an integer in B and perl will not call your B -routine unless the string being matched as at least that long. Returns -the currently set length if not called with any arguments or C -if no length has been set. + package re::engine::Plugin; + sub AUTOLOAD + { + our $AUTOLOAD; + my ($name) = $AUTOLOAD =~ /.*::(.*?)/; + my $cv = getmeth($name); # or something like that + goto &$cv; + } + + package re::engine::SomeEngine; + + sub comp + { + my $re = shift; + + $re->add_method( # or something like that + foshizzle => sub { + my ($re, @arg) = @_; # re::engine::Plugin, 1..5 + }, + ); + } + + package main; + use re::engine::SomeEngine; + later: + + my $re = qr//; + $re->foshizzle(1..5); + +=item * + +Implement the dupe callback, test this on a threaded perl (and learn +how to use threads and how they break the current model). + +=item * + +Allow the user to specify ->offs either as an array or a packed +string. Can pack() even pack I32? Only IV? int? + +=item * + +Add tests that check for different behavior when curpm is and is not +set. + +=item * + +Add tests that check the refcount of the stash and other things I'm +mucking with, run valgrind and make sure everything is destroyed when +it should. + +=item * + +Run the debugger on the testsuite and find cases when the intuit and +checkstr callbacks are called. Write wrappers around them and add +tests. =back -=head1 SEE ALSO +=head1 DEPENDENCIES -L +L 5.10. -=head1 THANKS +A C compiler. +This module may happen to build with a C++ compiler as well, but don't rely on it, as no guarantee is made in this regard. -Yves explaining why I made the regexp engine a sad panda. +L (standard since perl 5.6.0). -=head1 AUTHOR +=head1 BUGS -Evar ArnfjErE Bjarmason +Please report any bugs that aren't already listed at +L to +L + +=head1 AUTHORS + +Evar ArnfjErE Bjarmason C<< >> + +Vincent Pit C<< >> =head1 LICENSE +Copyright 2007,2008 Evar ArnfjErE Bjarmason. + +Copyright 2009,2010,2011,2013,2014,2015 Vincent Pit. + This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. -Copyright 2007 Evar ArnfjErE Bjarmason. - =cut