My regex cheatsheet
Regular expressions are widely used in detecting text patterns. In this post, I will summarize my list of tricks and common regex that I find useful.
Basic REGEX operations
- In python’s
re
, raw python strings are used to allow the usage of\
as a literal that matches a backslash. re
provides Perl-style regular expression patterns.\d
matches any numeric digit\D
matches any non-numeric digit\w
matches any alphanumeric character [A-Z a-z 0-9 and _]\W
matches any non-alphanumeric character\s
matches any whitespace token (\t \n \r\n )\S
matches any non-whitespace token- Since
.
is a special token that matches any character except the endline\n
then you need to skip it\.
in case you want to match a dot ^
and$
are used to indicate that a regex should occur at the start and/or the end of the sentence- Using a carat
^
in a optional group like[0-9]
negates the range.[^0-9]
matches any character that isn’t in range 0-9. \b
matches any word boundary. A word boundary means: 1) Start of the string. 2) End of the string 3)\W
“Any non-alphanumeric character” This character just means that a regex is in a certain boundary but the boundary character isn’t matched. Example:re.search(r'\ba\b', 'ads a ')
will match the single character'a'
without its surrounding spaces. On the other hand,re.search(r'\Wa\W', 'ads a ')
will also match the spaces' a '
.- On specifying a range don’t add a space after the comma!!
{m,n}
works while{m, n}
doesn’t work. ( )
are used to form groups of characters/patterns. This is useful in two cases: 1) Matching a certain string as a whole(ha){2,}
matches strings like haha, hahaha, …. 2) Capturing certain groups from the regex matching(bob|alice)
will actually capture the name in some sort of a variable (group) that can be used later.- Groups are used to parse certain patterns within the string. They are indicated by the
(
and the)
characters. For example:re.search(r'(\w+):(\w+)', 'Name:Amr')
extracts the field and the value from the lineName:Amr
re.search(r'(\w+):(\w+)', 'Name:Amr').group(1)
will be equal toName
re.search(r'(\w+):(\w+)', 'Name:Amr').group(2)
will be equal toAmr
group(0)
returns the whole matched string (Name:Amr
in this case) - There are some useful flags that are worth using. For example,
re.IGNORECASE
can be used to match both lowercase and uppercase charactersre.match(r'amr', INPUT_STR, re.IGNORECASE)
will match (amr, Amr, aMr, amR, AMr, aMR, AmR, AMR). - To use multiple flags, these flags are combined using bitwise ORing
|
re.VERBOSE
flag allows the usage of spaces and comments in a regex. According to python3’s regex guide: “whitespace within the RE string is ignored, except when the whitespace is in a character class or preceded by an unescaped backslash; this lets you organize and indent the RE more clearly.” Example:charref = re.compile(r""" &[#] # Start of a numeric entity reference ( 0[0-7]+ # Octal form | [0-9]+ # Decimal form | x[0-9a-fA-F]+ # Hexadecimal form ) ; # Trailing semicolon """, re.VERBOSE)
can be used instead of:
charref = re.compile("&#(0[0-7]+" "|[0-9]+" "|x[0-9a-fA-F]+);")
- Other strings can be used inside the regex,
r'REGEX_START{}REGEX_END'.format(EXTERNAL_STRING)
Extensions to the basic operations
- There are some extensions that were added by PERL and ported to python. These Extensions are activated using
?
after a(
. (?: )
is called a non-capturing group and it’s used to match certain patterns but without saving them in variables (groups) to be used later.(?P<GROUP_NAME>GROUP_REGEX)
is used to give a key to a certain group instead of depending on its index.- Lookaheads are assertions that check that a certain pattern is found (position lookahead)
(?=...)
or not found (negative lookahead)(?!...)
without actually consuming the characters. For example, Imagine that we want to match the files with all extensions excluding.bat
. The regex.*[.](?!bat$)[^.]*$
will be composed of:.*
to match the file name[.]
to match the dot- Now, we will need to use a negative lookahead. Ensure that the next characters aren’t
bat
. The pattern for this is(?!bat$)
which asserts that the part following the dot isn’tbat$
- If the negative lookahead is satisfied, we will match the extension as follows:
[^.]*$
match all characters other than a dot until the end of the string
- Similarly, there are two lookbehind assertions (positive lookbehind)
(?<=...)
and (negative lookbehind)(?<!...)
Things that needs more investigation
- Greedy vs non-greedy matching
- Using Locale aware matching
Resources and references
- Hackerrank challenges: https://www.hackerrank.com/domains/regex