Regular expressions are widely used in detecting text patterns. In this post, I will summarize my list of tricks and common regex that I find useful.

Basic REGEX operations

  • In python’s re, raw python strings are used to allow the usage of \ as a literal that matches a backslash.
  • re provides Perl-style regular expression patterns.
  • \d matches any numeric digit
  • \D matches any non-numeric digit
  • \w matches any alphanumeric character [A-Z a-z 0-9 and _]
  • \W matches any non-alphanumeric character
  • \s matches any whitespace token (\t \n \r\n )
  • \S matches any non-whitespace token
  • Since . is a special token that matches any character except the endline \n then you need to skip it \. in case you want to match a dot
  • ^ and $ are used to indicate that a regex should occur at the start and/or the end of the sentence
  • Using a carat ^ in a optional group like [0-9] negates the range. [^0-9] matches any character that isn’t in range 0-9.
  • \b matches any word boundary. A word boundary means: 1) Start of the string. 2) End of the string 3) \W “Any non-alphanumeric character” This character just means that a regex is in a certain boundary but the boundary character isn’t matched. Example: re.search(r'\ba\b', 'ads a ') will match the single character 'a' without its surrounding spaces. On the other hand, re.search(r'\Wa\W', 'ads a ') will also match the spaces ' a '.
  • On specifying a range don’t add a space after the comma!! {m,n} works while {m, n} doesn’t work.
  • ( ) are used to form groups of characters/patterns. This is useful in two cases: 1) Matching a certain string as a whole (ha){2,} matches strings like haha, hahaha, …. 2) Capturing certain groups from the regex matching (bob|alice) will actually capture the name in some sort of a variable (group) that can be used later.
  • Groups are used to parse certain patterns within the string. They are indicated by the ( and the ) characters. For example: re.search(r'(\w+):(\w+)', 'Name:Amr') extracts the field and the value from the line Name:Amr re.search(r'(\w+):(\w+)', 'Name:Amr').group(1) will be equal to Name re.search(r'(\w+):(\w+)', 'Name:Amr').group(2) will be equal to Amr group(0) returns the whole matched string (Name:Amr in this case)
  • There are some useful flags that are worth using. For example, re.IGNORECASE can be used to match both lowercase and uppercase characters re.match(r'amr', INPUT_STR, re.IGNORECASE) will match (amr, Amr, aMr, amR, AMr, aMR, AmR, AMR).
  • To use multiple flags, these flags are combined using bitwise ORing |
  • re.VERBOSE flag allows the usage of spaces and comments in a regex. According to python3’s regex guide: “whitespace within the RE string is ignored, except when the whitespace is in a character class or preceded by an unescaped backslash; this lets you organize and indent the RE more clearly.” Example:
    charref = re.compile(r"""
    &[#]                # Start of a numeric entity reference
    (
       0[0-7]+         # Octal form
     | [0-9]+          # Decimal form
     | x[0-9a-fA-F]+   # Hexadecimal form
     )
     ;                   # Trailing semicolon
    """, re.VERBOSE)
    

    can be used instead of:

    charref = re.compile("&#(0[0-7]+"
                       "|[0-9]+"
                       "|x[0-9a-fA-F]+);")
    
  • Other strings can be used inside the regex, r'REGEX_START{}REGEX_END'.format(EXTERNAL_STRING)

Extensions to the basic operations

  • There are some extensions that were added by PERL and ported to python. These Extensions are activated using ? after a (.
  • (?: ) is called a non-capturing group and it’s used to match certain patterns but without saving them in variables (groups) to be used later.
  • (?P<GROUP_NAME>GROUP_REGEX) is used to give a key to a certain group instead of depending on its index.
  • Lookaheads are assertions that check that a certain pattern is found (position lookahead) (?=...) or not found (negative lookahead) (?!...) without actually consuming the characters. For example, Imagine that we want to match the files with all extensions excluding .bat. The regex .*[.](?!bat$)[^.]*$ will be composed of:
    • .* to match the file name
    • [.] to match the dot
    • Now, we will need to use a negative lookahead. Ensure that the next characters aren’t bat. The pattern for this is (?!bat$) which asserts that the part following the dot isn’t bat$
    • If the negative lookahead is satisfied, we will match the extension as follows: [^.]*$ match all characters other than a dot until the end of the string
  • Similarly, there are two lookbehind assertions (positive lookbehind) (?<=...) and (negative lookbehind) (?<!...)

Things that needs more investigation

  • Greedy vs non-greedy matching
  • Using Locale aware matching

Resources and references