Perl literacy course.

Perl literacy course.

There's more than one way to do it

lecture #3

Shlomo Yona <yona@cs.technion.ac.il> http://yeda.cs.technion.ac.il/~yona/


Today

Today

  1. Quantifiers in regular expressions
  2. Greedy matching and non greedy matching
  3. Search and replace
  4. Examples with split()
  5. Scope and lifetime of variables
  6. a word about references
  7. pass by reference
  8. subroutines prototypes
  9. Name lookup algorithm (handouts)

Today's lecture will also cover a lot of small details.

We will run through many examples.

Instead of trying to remember all the small details, try to focus on the big picture, as you can always check out the details on the freely available perldocs.


Matching repetitions

Matching repetitions

The quantifier metacharacters "?", "*", "+", and "{}" allow us to determine the number of repeats of a portion of a regex we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:

  • "a?" = match 'a' 1 or 0 times
  • "a*" = match 'a' 0 or more times, i.e., any number of times
  • "a+" = match 'a' 1 or more times, i.e., at least once
  • "a{n,m}" = match at least "n" times, but not more than "m" times.
  • "a{n,}" = match at least "n" or more times
  • "a{n}" = match exactly "n" times


    Matching repetitions (cont.)

    Matching repetitions

    
    /[a-z]+\s+\d*/;	# match a lowercase word, at least some space, and
    		# any number of digits
    /(\w+)\s+\1/;	# match doubled words of arbitrary length
    

    $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits

    $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates

    These quantifiers will try to match as much of the string as possible, while still allowing the regex to match. So we have

    $x = 'the cat in the hat'; $x =~ /^(.*)(at)(.*)$/; # matches, # $1 = 'the cat in the h' # $2 = 'at' # $3 = '' (0 matches)

    The first quantifier ".*" grabs as much of the string as possible while still having the regex match. The second quantifier ".*" has no string left to it, so it matches 0 times.


    Greedy pattern matching

    Greedy pattern matching

    By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.

    
    "12345" =~ /(\d+)(\d)/; 	# $1 = '1234'	
    				# $2 = '5'
    


    Non greedy pattern matching

    Non greedy pattern matching

    If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":

    
    "12345" =~ /(\d+?)(\d)/; 	# $1 = '1'
    				# $2 = '2'
    


    Non greedy pattern matching (cont.)

    Non greedy pattern matching

    
    *?     Match 0 or more times
    +?     Match 1 or more times
    ??     Match 0 or 1 time
    {n}?   Match exactly n times
    {n,}?  Match at least n times
    {n,m}? Match at least n but not more than m times
    


    Search and replace

    Search and replace

    
    % perl -i -pe 's/search/replace/g' file
    % sed -e 's/search/replace/g'
    % nawk '{gsub("search","replace");print}'
    


    Search and replace (cont.)

    Search and replace

    Search and replace is performed using "s/regex/replace­ ment/modifiers". The "replacement" is a Perl double quoted string that replaces in the string whatever is matched with the "regex". The operator "=~" is also used here to associate a string with "s///". If matching against "$_", the "$_ =~" can be dropped. If there is a match, "s///" returns the number of substitutions made, otherwise it returns false.

    
    $x = "Time to feed the cat!";
    $x =~ s/cat/hacker/;	# $x contains "Time to feed the hacker!"
    $y = "'quoted words'";
    $y =~ s/^'(.*)'$/$1/;	# strip single quotes,
    			# $y contains "quoted words"
    


    Search and replace (cont.)

    Search and replace

    With the "s///" operator, the matched variables "$1", "$2", etc. are immediately available for use in the replacement expression. With the global modifier, "s///g" will search and replace all occurrences of the regex in the string:

    
    $x = "I batted 4 for 4";
    $x =~ s/4/four/;   # $x contains "I batted four for 4"
    $x = "I batted 4 for 4";
    $x =~ s/4/four/g;  # $x contains "I batted four for four"
    


    Search and replace (cont.)

    Search and replace

    The evaluation modifier "s///e" wraps an "eval{...}" around the replacement string and the evaluated result is substituted for the matched substring.

    
    # reverse all the words in a string
    $x = "the cat in the hat";
    $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
    

    # convert percentage to decimal $x = "A 39% hit rate"; $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"

    The last example shows that "s///" can use other delimiters, such as "s!!!" and "s{}{}", and even "s{}//". If single quotes are used "s'''", then the regex and replacement are treated as single quoted strings.


    The split operator

    The split operator

    "split /regex/, string" splits "string" into a list of substrings and returns that list. The regex determines the character sequence that "string" is split with respect to. For example, to split a string into words, use

    
    $x = "Calvin and Hobbes";
    @word = split /\s+/, $x;  # $word[0] = 'Calvin'
    			  # $word[1] = 'and'
    			  # $word[2] = 'Hobbes'
    


    The split operator (cont.)

    The split operator

    To extract a comma-delimited list of numbers, use

    
    $x = "1.618,2.718,   3.142";
    @const = split /,\s*/, $x;  # $const[0] = '1.618'
    			    # $const[1] = '2.718'
    			    # $const[2] = '3.142'
    


    The split operator (cont.)

    The split operator

    If the empty regex "//" is used, the string is split into individual characters. If the regex has groupings, then list produced contains the matched substrings from the groupings as well:

    
    $x = "/usr/bin";
    @parts = split m!(/)!, $x;	# $parts[0] = ''
    				# $parts[1] = '/'
    				# $parts[2] = 'usr'
    				# $parts[3] = '/'
    				# $parts[4] = 'bin'
    

    Since the first character of $x matched the regex, "split" prepended an empty initial element to the list.


    Reading more about Regular expressions:

    Reading more about Regular expressions:


    Subroutines

    Subroutines

    To declare subroutines:

    
    sub NAME;                     # A "forward" declaration.
    sub NAME(PROTO);              #  ditto, but with prototypes
    

    sub NAME BLOCK # A declaration and a definition. sub NAME(PROTO) BLOCK # ditto, but with prototypes


    Subroutines (cont.)

    Subroutines

    To define an anonymous subroutine at runtime:

    
    $subref = sub BLOCK;                 # no proto
    $subref = sub (PROTO) BLOCK;         # with proto
    


    Subroutines (cont.)

    Subroutines

    To call subroutines:

    
    NAME(LIST);    # & is optional with parentheses.
    NAME LIST;     # Parentheses optional if predeclared/imported.
    &NAME(LIST);   # Circumvent prototypes.
    &NAME;         # Makes current @_ visible to called subroutine.
    


    reserved names for subroutines

    reserved names for subroutines

    Functions whose names are in all upper case are reserved to the Perl core, as are modules whose names are in all lower case.

    A function in all capitals is a loosely-held convention meaning it will be called indirectly by the run-time system itself, usually due to a triggered event.

    Functions that do special, pre-defined things include "BEGIN", "CHECK", "INIT", "END", "AUTOLOAD", and "DESTROY"--plus all functions mentioned in the perltie manpage.


    Private Variables via my()

    Private Variables via my()

    Synopsis:

    
    my $foo;            # declare $foo lexically local
    my (@wid, %get);    # declare list of variables local
    my $foo = "flurp";  # declare $foo lexical, and init it
    my @oof = @bar;     # declare @oof lexical, and init it
    


    Private Variables via my() (cont.)

    Private Variables via my()

    The "my" operator declares the listed variables to be lexically confined to the enclosing block, conditional ("if/unless/elsif/else"), loop ("for/fore­ ach/while/until/continue"), subroutine, "eval", or "do/require/use"'d file.

    If more than one value is listed, the list must be placed in parentheses.

    All listed elements must be legal lvalues.

    Only alphanumeric identifiers may be lexically scoped--magical built-ins like "$/" must currently be "local"ize with "local" instead.


    Private Variables via my() (cont.)

    Private Variables via my()

    Unlike dynamic variables created by the "local" operator, lexical variables declared with "my" are totally hidden from the outside world, including any called subroutines. This is true if it's the same subroutine called from itself or elsewhere--every call gets its own copy.


    Private Variables via my() (cont.)

    Private Variables via my()

    Unlike dynamic variables created by the "local" operator, lexical variables declared with "my" are totally hidden from the outside world, including any called subroutines.

    This is true if it's the same subroutine called from itself or elsewhere--every call gets its own copy.


    Private Variables via my() (cont.)

    Private Variables via my()

    This doesn't mean that a "my" variable declared in a statically enclosing lexical scope would be invisible. Only dynamic scopes are cut off.

    For example, the "bumpx()" function below has access to the lexical $x variable because both the "my" and the "sub" occurred at the same scope, presumably file scope.

    
    my $x = 10;
    sub bumpx { $x++ }
    


    more about my()

    more about my()

    
    my $foo, $bar = 1;                  # WRONG defines only one variable:
    

    That has the same effect as

    
    my $foo;
    $bar = 1;
    


    more about my() (cont.)

    more about my()

    The declared variable is not introduced (is not visible) until after the current statement. Thus,

    
    my $x = $x;
    

    can be used to initialize a new $x with the value of the old $x, and the expression

    
    my $x = 123 and $x == 123
    

    is false unless the old $x happened to have the value "123".


    use strict;

    use strict;

    
    use strict 'vars';
    

    Forces you to declare variables either by 'use vars' or by 'our' or by 'my'.


    use strict; (cont.)

    use strict;

    
    use strict; 
    

    Employs more restrictions (see 'perldoc strict' for more information)


    Persistent Private Variables

    Persistent Private Variables

    Just because a lexical variable is lexically (also called statically) scoped to its enclosing block, this doesn't mean that within a function it works like a C static.

    It normally works more like a C auto, but with implicit garbage collection.


    Persistent Private Variables (AKA static variables)

    Persistent Private Variables (AKA static variables)

    Unlike local variables in C or C++, Perl's lexical variables don't necessarily get recycled just because their scope has exited.

    If something more permanent is still aware of the lexical, it will stick around.

    So long as something else references a lexical, that lexical won't be freed--which is as it should be.

    You wouldn't want memory being free until you were done using it, or kept around once you were done.

    Automatic garbage collection takes care of this for you.


    Persistent Private Variables (AKA static variables) (cont.)

    Persistent Private Variables (AKA static variables)

    This means that you can pass back or save away references to lexical variables, whereas to return a pointer to a C auto is a grave error.

    It also gives us a way to simulate C's function statics.

    Here's a mechanism for giving a function private variables with both lexical scoping and a static lifetime.

    If you do want to create something like C's static variables, just enclose the whole function in an extra block, and put the static variable outside the function but in the block.


    Simulating a C static variable

    Simulating a C static variable

    
    {
    	my $secret_val = 0;
    	sub gimme_another {
    		return ++$secret_val;
    	}
    }
    # $secret_val now becomes unreachable by the outside
    # world, but retains its value between calls to gimme_another
    


    Dynamic scoping

    Dynamic scoping

    declaring variables using 'local' gives these variables dynamic scoping.

    this is usually not something you'd like to do - you'd better stick to 'my', unless you really know what you're doing.


    Dynamic scoping (cont.)

    Dynamic scoping

    
    $a = 3.1416;
    {
    	local $a = 2.7183;
    	print "$a\n";	# 2.7183
    }
    print "$a\n";	# 3.1416
    

    Although this looks like it does the same thing 'my' would in terms of output, behind the scenes something completely different happens.


    Dynamic scoping (cont.)

    Dynamic scoping

    In the case of 'my' Perl creates a separate variable that cannot be accessed by name at run time. In other words, it never appears in a package symbol table. During the execution of the inner block, the global $a on the outside continues to exist, with its value of 3.1416, in the symbol table.

    In the case of 'local', Perl saves the current contents of $a on a run-time stack. The contents of $a are then REPLACED by the new value. When the program exits the enclosing block, the values saved by 'local' are restored. There is only one variable named $a in existence throughout the entire example.


    Dynamic scoping (cont.)

    Dynamic scoping

    See the 'Temporary Values via local()' entry in the perlsub manpage for details and also the 'When to Still Use local()' entry in the perlsub manpage.


    References

    References

    A detailed discussion and overview of references will be given in the next lecture.

    For now, we will just see one way of referencing and de-referencing things in Perl.


    References (cont.)

    References

    reference

    
    $scalarref = \$foo;
    $arrayref  = \@ARGV;
    $hashref   = \%ENV;
    $coderef   = \&handler;
    


    References (cont.)

    References

    de-reference

    
    $bar = $$scalarref;
    push(@$arrayref, $filename);
    $$arrayref[0] = "January";
    $$hashref{"KEY"} = "VALUE";
    &$coderef(1,2,3);
    


    References (cont.)

    References

    There are many more ways of referencing things in Perl and also dereferencing them, but all we need for now is the terminology and some very basic understanding, so we can get on with this lecture's material - we will understand more in this evening's lecture.


    Pass by Reference

    Pass by Reference

    If you want to pass more than one array or hash into a function--or return them from it--and have them maintain their integrity, then you're going to have to use an explicit pass-by-reference.

    Here are a few simple examples. First, let's pass in several arrays to a function and have it "pop" all of then, returning a new list of all their former last elements:

    
    @tailings = popmany ( \@a, \@b, \@c, \@d );
    

    sub popmany { my $aref; my @retlist = (); foreach $aref ( @_ ) { push @retlist, pop @$aref; } return @retlist; }


    Pass by Reference (cont.)

    Pass by Reference

    Here's how you might write a function that returns a list of keys occurring in all the hashes passed to it:

    
    @common = inter( \%foo, \%bar, \%joe );
    sub inter {
    	my ($k, $href, %seen); # locals
    	foreach $href (@_) {
    		while ( $k = each %$href ) {
    			$seen{$k}++;
    		}
    	}
    	return grep { $seen{$_} == @_ } keys %seen;
    }
    


    Pass by Reference (cont.)

    Pass by Reference

    So far, we're using just the normal list return mechanism.

    What happens if you want to pass or return a hash? Well, if you're using only one of them, or you don't mind them concatenating, then the normal calling convention is ok, although a little expensive.

    Where people get into trouble is here:

    
    (@a, @b) = func(@c, @d);	# MISTAKE!
    
    or
    
    (%a, %b) = func(%c, %d);	# MISTAKE!
    

    That syntax simply won't work. It sets just "@a" or "%a" and clears the "@b" or "%b". Plus the function didn't get passed into two separate arrays or hashes: it got one long list in "@_", as always.


    Prototypes

    Prototypes

    Perl supports a very limited kind of compile-time argument checking using function prototyping.

    Declared as
    Called as
    sub mylink ($$)
    mylink $old, $new
    sub myvec ($$$)
    myvec $var, $offset, 1
    sub myindex ($$;$)
    myindex &getstring, "substr"
    sub mysyswrite ($$$;$)
    mysyswrite $buf, 0, length($buf) - $off, $off
    sub myreverse (@)
    myreverse $a, $b, $c
    sub myjoin ($@)
    myjoin ":", $a, $b, $c
    sub mypop (\@)
    mypop @array
    sub mysplice (\@$$@)
    mysplice @array, @array, 0, @pushme
    sub mykeys (\%)
    mykeys %{$hashref}
    sub myopen (*;$)
    myopen HANDLE, $name
    sub mypipe (**)
    mypipe READHANDLE, WRITEHANDLE
    sub mygrep (&@)
    mygrep { /foo/ } $a, $b, $c
    sub myrand ($)
    myrand 42
    sub mytime ()
    mytime


    A note about returning lists from subroutines

    A note about returning lists from subroutines

    You probably figured out that subroutines can return either scalar or list values.

    You probably understand the significance of scalar and list context in Perl.

    You should consider using 'wantarray' in subroutines which return lists.


    A note about returning lists from subroutines (cont.)

    A note about returning lists from subroutines

    wantarray

    Returns true if the context of the currently executing subroutine is looking for a list value.

    Returns false if the context is looking for a scalar.

    Returns the undefined value if the context is looking for no value (void context).

    
    return unless defined wantarray;    # don't bother doing more
    my @a = complex_calculation();
    return wantarray ? @a : "@a";
    

    This function should have been named wantlist() instead.


    Name lookup algorithm

    Name lookup algorithm

    Please see your handouts and read this offline.

    We will not have time to present this in class properly in this lecture, but you might want to be sure you get the idea - so you can figure out how Perl decides if and when your variable is valid or not.


    Further reading

    Further reading


    Thank you

    Thank you