A regular expression (regex) is a string which represents a particular pattern. It is heavily used in Perl to check if a pattern is present in a string or not. In the practical world, we can use regular expression to validate phone numbers(must be 10 digits), email addresses(must be in the form of abc@example.com), etc.
We can check if a regular expression is present in a string or not by using $string =~ /regular_expression/ in Perl.
Simplest regular expression is using alphanumeric characters i.e., /alpha_num/, where 'alpha_num' contains only alphanumeric characters. Alphanumeric characters in a regular expression stand for themselves. It means that /abc/ will stand for the string "abc". So, let's take an example to see this:
$a = "Hey there. I am Perl";
if ($a =~ /perl/){
print "YES, perl present\n";
}
elsif ($a =~ /Perl/){
print "Perl present\n";
}
else{
print "NO\n";
}
You can see that $a =~ /Perl/ checked if "Perl" is present in $a or not. Also, "perl" didn't match in the string 'a' because this matching is case-sensitive.
Using i - case-insensitive
We can use i to make the match case-insensitive. Let's see an example:
$a = "Hey there. I am Perl";
if ($a =~ /perl/i){
print "YES, perl present\n";
}
elsif ($a =~ /Perl/){
print "Perl present\n";
}
else{
print "NO\n";
}
Using m
The // default delimiters for a match can be changed to arbitrary delimiters by putting a 'm'. It means that our pattern will then be after m and not inside //. One of the situations in which it can be useful is when we need to use '/' in the pattern itself. Let's see an example:
if ("/he" =~ m"/h"){
print "YES\n";
}
Metacharacters
We can't use every character directly in a pattern. These characters are called metacharacters. These are {}[]()^$.|*+?\. To use them in a pattern, we use a backslash (\) before them. Let's see an example:
if ("10*5+8" =~ /10\*5/){
print "YES\n";
}
Notice that we have used '\' before '*' because '*' is a metacharacter.
Using ^ and $
We can use ^ to match at the beginning of a string. This means that the provided pattern must be present at the beginning of the string.
We can use $ to match at end of the end of a string or before a newline. This means that the provided pattern must be present at the end of the string.
For example, '^abc' will match in 'abcdef' but '^def' won't. Also, 'def$' will match in 'abcdef' but 'abc$' won't.
'^abcdef$' will match in 'abcdef'. Let's see a working example:
if ("abcdef" =~ /^abc/){
print "abc in front\n";
}
if ("abcdef" =~ /^def/){
print "def in front\n";
}
if ("abcdef" =~ /abc$/){
print "abc at last\n";
}
if ("abcdef" =~ /def$/){
print "def at last\n";
}
def at last
Matching anything
dot (.) in a regular expression matches to everything except a newline character. It means that a dot(.) will match to every character(H, e, l, and o) and will be true for a string "Hello". We can use /H...o/ to check if a 5 lettered word which starts with H and ends with o is present or not. /H.l../ can be used to check if a five lettered word whose first letter is H and the third letter is l is present or not. Let's see an example to do this:
if ("Hello" =~ /H...o/){
print "YES\n";
}
if ("Hello" =~ /H.t../){
print "Yes word with t as third and H as first letter present.\n"
}
if ("Hello" =~ /H.l../){
print "YES\n";
}
YES
Using []
[] is used to match a set of characters present inside brackets. For example, "sas"=~[abc] will match to a; [ab]10 will match to a1, b1, a0 and b0; "abc" =~ /[cab]/ will match to 'a' because the earliest point at which the regex can match is 'a'.
We can use [] with a hyphen (-) to match a sequence. For example, [a-z] can be used if any of the characters from a to z is present in a string or not. We can use [a-zA-Z0-9] to match every alphanumeric character. Let's see its example:
if ("Hello" =~ /[a-z]/){
print "lowercase present\n";
}
if ("hello" =~ /[A-Z]/){
print "uppercase present\n";
}
else{
print "uppercase absent\n";
}
if ("Hello123" =~ /[a-zA-Z]/){
print "No numeric\n";
}
else{
print "Yes numeric also\n";
}
uppercase absent
No numeric
^ inside []
The function of ^ changes when it is used inside [] at first position. If the first character inside [] is ^, then everything will be matched except the characters inside [].
Thus /[^a]/ will not match to 'a'.
[^0-9] will match to a non-numeric character.
[^c]ar will not match to car. Let's see an example:
if ("car" =~ /[^c]ar/){
print "It is not a car.\n";
}
else{
print "It is a car\n";
}
Perl uses the following abbreviations for common character classes:
Abbreviation | Represents | for |
---|---|---|
\d | digit | [0-9] |
\s | whitespace | [\ \t\r\n\f] |
\w | word character (alphanumeric or _) | [0-9a-zA-Z_] |
\D | any character except a digit | [^0-9] |
\S | any non-whitespace character | [^\s] |
\W | any non-word character | [^\w] |
. | matches any except \n |
Using | for alternation
We can use | for alternation. For example, bike|car will match either bike or car. Perl will try to match the regex at the earliest possible point in the string. Thus Perl will first check for bike and if bike is not found, then only Perl will then try the next alternative, car. Let's see an example:
if ("car" =~ /bike|car/){
print "Either bike or car found.\n";
}
else{
print "None found\n";
}
Checking repetitions
The metacharacters ?, * , + , and {} are used to determine the number of repetitions of a portion of a regex for a match. See the table given below to understand this:
Character | Matches | Example |
---|---|---|
? | 1 or 0 times | "abc" =~ /ab?c/ matches "abbc" =~ /ab?c/ doesn't match |
* | any number of times | "abbc" =~ /ab*c/ matches |
+ | atleast once | "abbc" =~ /ab+c/ matches "ac" =~ /ab*c/ doesn't match |
{a,b} | atleast a number of times but not more than b | "abbc" =~ /ab{1,10}c/ matches "abbc" =~ /ab{5,10}c/ doesn't match |
{a,} | atleast a or more number of times | "abbc" =~ /ab{1,}c/ matches "abbc" =~ /ab{10,}c/ doesn't match |
{a} | exactly a number of times | "abbc" =~ /ab{2}c/ matches "abbc" =~ /ab{10}c/ doesn't match |
Search and replace
We can search a regex and replace it using s///. The syntax is s/regex/replacement/modifiers. Here, 'replacement' will replace the pattern 'regex' from the string. We may or may not provide any modifier according to our need. Let's see an example of doing the same:
$a = "take ride on car";
$a =~ s/car/bike/;
print "$a\n";
grep
grep is a function in Perl which is used to extract or filter elements from an array after matching with some regex. Its syntax is grep(regex,ARRAY). Let's see an example for separating all the numbers from an array.
@a = (1,'a',2,'b',3,'c',4,'d',5,'e');
@num = grep(/\d/,@a);
print "@num\n";