X-Git-Url: http://git.vpit.fr/?p=perl%2Fmodules%2Fre-engine-Plugin.git;a=blobdiff_plain;f=README;fp=README;h=9da0ecf52406a10dfc133c87339060e5d39465a5;hp=0000000000000000000000000000000000000000;hb=de28b8b2f575972f4c8ccb7f037db6f3b38860cf;hpb=5f3eda5a24d339e157734012513a9a8c047da3bb diff --git a/README b/README new file mode 100644 index 0000000..9da0ecf --- /dev/null +++ b/README @@ -0,0 +1,360 @@ +NAME + re::engine::Plugin - API to write custom regex engines + +DESCRIPTION + As of perl 5.9.5 it's possible to lexically replace perl's built-in + regular expression engine with your own (see perlreapi and perlpragma). + This module provides a glue interface to the relevant parts of the perl + C API enabling you to write an engine in Perl instead of the C/XS + interface provided by the core. + + The gory details + Each regex in perl is compiled into an internal "REGEXP" structure (see + perlreapi), this can happen either during compile time in the case of + patterns in the format "/pattern/" or runtime for "qr//" patterns, or + something inbetween depending on variable interpolation etc. + + When this module is loaded into a scope it inserts a hook into + $^H{regcomp} (as described in perlreapi and perlpragma) to have each + regexp constructed in its lexical scope handled by this engine, but it + differs from other engines in that it also inserts other hooks into + "%^H" in the same scope that point to user-defined subroutines to use + during compilation, execution etc, these are described in "CALLBACKS" + below. + + The callbacks (e.g. "comp") then get called with a re::engine::Plugin + object as their first argument. This object provies access to perl's + internal REGEXP struct in addition to its own state (e.g. a stash). The + methods on this object allow for altering the "REGEXP" struct's internal + state, adding new callbacks, etc. + +CALLBACKS + Callbacks are specified in the "re::engine::Plugin" import list as + key-value pairs of names and subroutine references: + + use re::engine::Plugin ( + comp => sub {}, + exec => sub {}, + ); + + To write a custom engine which imports your functions into the caller's + scope use use the following snippet: + + package re::engine::Example; + use re::engine::Plugin (); + + sub import + { + # Sets the caller's $^H{regcomp} his %^H with our callbacks + re::engine::Plugin->import( + comp => \&comp, + exec => \&exec, + ); + } + + *unimport = \&re::engine::Plugin::unimport; + + # Implementation of the engine + sub comp { ... } + sub exec { ... } + + 1; + + comp + comp => sub { + my ($rx) = @_; + + # return value discarded + } + + Called when a regex is compiled by perl, this is always the first + callback to be called and may be called multiple times or not at all + depending on what perl sees fit at the time. + + The first argument will be a freshly constructed "re::engine::Plugin" + object (think of it as $self) which you can interact with using the + methods below, this object will be passed around the other callbacks and + methods for the lifetime of the regex. + + Calling "die" or anything that uses it (such as "carp") here will not be + trapped by an "eval" block that the pattern is in, i.e. + + use Carp 'croak'; + use re::engine::Plugin( + comp => sub { + my $rx = shift; + croak "Your pattern is invalid" + unless $rx->pattern ~~ /pony/; + } + ); + + # Ignores the eval block + eval { /you die in C, you die for real/ }; + + This happens because the real subroutine call happens indirectly at + compile time and not in the scope of the "eval" block. This is how + perl's own engine would behave in the same situation if given an invalid + pattern such as "/(/". + + exec + exec => sub { + my ($rx, $str) = @_; + + # We always like ponies! + return 1 if $str ~~ /pony/; + + # Failed to match + return; + } + + Called when a regex is being executed, i.e. when it's being matched + against something. The scalar being matched against the pattern is + available as the second argument ($str) and through the str method. The + routine should return a true value if the match was successful, and a + false one if it wasn't. + +METHODS + str + "str" ~~ /pattern/; + # in comp/exec/methods: + my $str = $rx->str; + + The last scalar to be matched against the pattern or "undef" if there + hasn't been a match yet. + + perl's own engine always stringifies the scalar being matched against a + given pattern, however a custom engine need not have such restrictions. + One could write a engine that matched a file handle against a pattern or + any other complex data structure. + + pattern + The pattern that the engine was asked to compile, this can be either a + classic Perl pattern with modifiers like "/pat/ix" or "qr/pat/ix" or an + arbitary scalar. The latter allows for passing anything that doesn't fit + in a string and five modifier characters, such as hashrefs, objects, + etc. + + mod + my %mod = $rx->mod; + say "has /ix" if %mod ~~ 'i' and %mod ~~ 'x'; + + A key-value pair list of the modifiers the pattern was compiled with. + The keys will zero or more of "imsxp" and the values will be true values + (so that you don't have to write "exists"). + + You don't get to know if the "eogc" modifiers were attached to the + pattern since these are internal to perl and shouldn't matter to regexp + engines. + + stash + comp => sub { shift->stash( [ 1 .. 5 ) }, + exec => sub { shift->stash }, # Get [ 1 .. 5 ] + + Returns or sets a user defined stash that's passed around as part of the + $rx object, useful for passing around all sorts of data between the + callback routines and methods. + + minlen + $rx->minlen($num); + my $minlen = $rx->minlen // "not set"; + + The minimum "length" a string must be to match the pattern, perl will + use this internally during matching to check whether the stringified + form of the string (or other object) being matched is at least this + long, if not the regexp engine in effect (that means you!) will not be + called at all. + + The length specified will be used as a a byte length (using SvPV), not a + character length. + + num_captures + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; + + return "value"; + }, + STORE => sub { + my ($re, $paren, $rhs) = @_; + + # return value discarded + }, + LENGTH => sub { + my ($re, $paren) = @_; + + return 123; + }, + ); + + Takes a list of key-value pairs of names and subroutines that implement + numbered capture variables. "FETCH" will be called on value retrieval + ("say $1"), "STORE" on assignment ("$1 = "ook"") and "LENGTH" on "length + $1". + + The second paramater of each routine is the paren number being + requested/stored, the following mapping applies for those numbers: + + -2 => $` or ${^PREMATCH} + -1 => $' or ${^POSTMATCH} + 0 => $& or ${^MATCH} + 1 => $1 + # ... + + Assignment to capture variables makes it possible to implement something + like Perl 6 ":rw" semantics, and since it's possible to make the capture + variables return any scalar instead of just a string it becomes possible + to implement Perl 6 match object semantics (to name an example). + + named_captures + TODO: implement + + perl internals still needs to be changed to support this but when it's + done it'll allow the binding of "%+" and "%-" and support the Tie::Hash + methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY and + SCALAR. + +Tainting + The only way to untaint an existing variable in Perl is to use it as a + hash key or referencing subpatterns from a regular expression match (see + perlsec), the latter only works in perl's regex engine because it + explicitly untaints capture variables which a custom engine will also + need to do if it wants its capture variables to be untanted. + + There are basically two ways to go about this, the first and obvious one + is to make use of Perl'l lexical scoping which enables the use of its + built-in regex engine in the scope of the overriding engine's callbacks: + + use re::engine::Plugin ( + exec => sub { + my ($re, $str) = @_; # $str is tainted + + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; + + # This is perl's engine doing the match + $str ~~ /(.*)/; + + # $1 has been untainted + return $1; + }, + ); + }, + ); + + The second is to use something like Taint::Util which flips the taint + flag on the scalar without invoking the perl's regex engine: + + use Taint::Util; + use re::engine::Plugin ( + exec => sub { + my ($re, $str) = @_; # $str is tainted + + $re->num_captures( + FETCH => sub { + my ($re, $paren) = @_; + + # Copy $str and untaint the copy + untaint(my $ret = $str); + + # Return the untainted value + return $ret; + }, + ); + }, + ); + + In either case a regex engine using perl's regex api or this module is + responsible for how and if it untaints its variables. + +SEE ALSO + perlreapi, Taint::Util + +TODO / CAVEATS + *here be dragons* + + * Engines implemented with this module don't support "s///" and "split + //", the appropriate parts of the "REGEXP" struct need to be wrapped + and documented. + + * Still not a complete wrapper for perlreapi in other ways, needs + methods for some "REGEXP" struct members, some callbacks aren't + implemented etc. + + * Support overloading operations on the "qr//" object, this allow + control over the of "qr//" objects in a manner that isn't limited by + "wrapped"/"wraplen". + + $re->overload( + '""' => sub { ... }, + '@{}' => sub { ... }, + ... + ); + + * Support the dispatch of arbitary methods from the re::engine::Plugin + qr// object to user defined subroutines via AUTOLOAD; + + package re::engine::Plugin; + sub AUTOLOAD + { + our $AUTOLOAD; + my ($name) = $AUTOLOAD =~ /.*::(.*?)/; + my $cv = getmeth($name); # or something like that + goto &$cv; + } + + package re::engine::SomeEngine; + + sub comp + { + my $re = shift; + + $re->add_method( # or something like that + foshizzle => sub { + my ($re, @arg) = @_; # re::engine::Plugin, 1..5 + }, + ); + } + + package main; + use re::engine::SomeEngine; + later: + + my $re = qr//; + $re->foshizzle(1..5); + + * Implement the dupe callback, test this on a threaded perl (and learn + how to use threads and how they break the current model). + + * Allow the user to specify ->offs either as an array or a packed + string. Can pack() even pack I32? Only IV? int? + + * Add tests that check for different behavior when curpm is and is not + set. + + * Add tests that check the refcount of the stash and other things I'm + mucking with, run valgrind and make sure everything is destroyed + when it should. + + * Run the debugger on the testsuite and find cases when the intuit and + checkstr callbacks are called. Write wrappers around them and add + tests. + +BUGS + Please report any bugs that aren't already listed at + to + + +AUTHORS + Ævar Arnfjörð Bjarmason "" + + Vincent Pit "" + +LICENSE + Copyright 2007-2008 Ævar Arnfjörð Bjarmason. + + Copyright 2009 Vincent Pit. + + This program is free software; you can redistribute it and/or modify it + under the same terms as Perl itself. +