ch6s3_RegularExpressions

Regular expressions (regex) are one of the most powerful tools for text analysis, validation, and transformation.

Chapter 6: Python Standard Library

Sub-Chapter: Regular Expressions — Pattern Matching and Text Manipulation

Regular expressions (regex) are one of the most powerful tools for text analysis, validation, and transformation.
They allow you to search, match, and extract complex patterns from text with concise syntax.
Python’s re module provides comprehensive support for regular expressions, combining flexibility with performance.


🧱 1. Importing the re Module

import re

This module provides functions such as search(), match(), findall(), split(), and sub() for working with regex patterns.


🔍 2. Basic Pattern Matching

Example

pattern = r"\d+"  # Matches one or more digits
match = re.search(pattern, "The price is $100")
print(match.group())  # Output: 100

Common Metacharacters

SymbolDescriptionExampleMatches
.Any character except newlinea.ba1b, acb
^Start of string^HelloHello world
$End of stringworld$Hello world
*0 or more repetitionsbo*b, bo, boo
+1 or more repetitionsgo+go, goo
?0 or 1 repetitioncolou?rcolor, colour
[]Character set[A-Z]Any capital letter
{n,m}Range of repetitions\d{2,4}99, 2024
\Escape special char\.A literal dot

🧩 3. Searching, Matching, and Finding

search() — Finds first match anywhere in the string

re.search(r"\d+", "Order ID: 12345").group()
# Output: '12345'

match() — Matches only at the beginning

re.match(r"\d+", "123abc").group()
# Output: '123'

findall() — Returns all matches

re.findall(r"\d+", "A1 B22 C333")
# Output: ['1', '22', '333']

finditer() — Returns iterator of match objects

for m in re.finditer(r"\d+", "12 and 34"):
    print(m.group(), "at", m.start())

split() — Split text by a regex pattern

re.split(r"\W+", "Words, separated_by symbols!")
# Output: ['Words', 'separated_by', 'symbols']

🧱 4. Grouping and Capturing

Parentheses () allow grouping parts of a regex and extracting submatches.

pattern = r"(\d{3})-(\d{2})"
match = re.search(pattern, "Phone: 123-45")
area, number = match.groups()
print(area, number)  # 123 45

Named Groups

pattern = r"(?P<area>\d{3})-(?P<number>\d{2})"
m = re.search(pattern, "Code: 555-12")
print(m.group("area"), m.group("number"))

Non-Capturing Groups

re.findall(r"(?:Mr|Mrs|Ms)\.\s[A-Z][a-z]+", "Mr. Smith and Ms. Clark")
# Output: ['Mr. Smith', 'Ms. Clark']

🔄 5. Replacement and Substitution

Use re.sub() to replace text matching a pattern.

text = "Apples are great. I like apple pie."
result = re.sub(r"(?i)apple", "banana", text)
print(result)

Output:

Bananas are great. I like banana pie.

Using a Function in Replacement

def censor_email(match):
    user, domain = match.group().split("@")
    return f"{user[0]}***@{domain}"

text = "Contact me at hello@example.com"
print(re.sub(r"[\w.-]+@[\w.-]+", censor_email, text))
# Output: h***@example.com

⚙️ 6. Flags (Modifiers)

Regex flags modify how the pattern behaves.

FlagDescriptionUsage
re.IGNORECASE or re.ICase-insensitive matchingre.search(r"python", "PYTHON", re.I)
re.MULTILINE or re.M^ and $ match start/end of each linere.search("^abc", text, re.M)
re.DOTALL or re.S. matches newline toore.search("a.*b", "a\n\nb", re.S)
re.VERBOSE or re.XAllow multi-line, commented regexUseful for readability

Example with Multiple Flags

pattern = re.compile(r"python.*rocks", re.I | re.S)
pattern.search("PYTHON\nreally ROCKS!")

🧮 7. Compiling Regular Expressions

Compiling improves performance when reusing patterns frequently.

phone_pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
print(phone_pattern.findall("Call 123-456-7890 or 987-654-3210"))

🧰 8. Real-World Use Cases

1️⃣ Email Extraction

emails = re.findall(r"[\w.-]+@[\w.-]+", "Contact support@example.com, admin@test.org")

2️⃣ Phone Validation

phone_numbers = ["123-456-7890", "555-1234", "800-555-5555"]
valid = [p for p in phone_numbers if re.fullmatch(r"\d{3}-\d{3}-\d{4}", p)]

3️⃣ Log Parsing

log = "2025-10-26 ERROR: Connection failed at 14:31:22"
parts = re.search(r"(\d{4}-\d{2}-\d{2})\s(\w+):\s(.+)", log)
print(parts.groups())

4️⃣ HTML Tag Removal

clean = re.sub(r"<.*?>", "", "<p>Hello <b>world</b></p>")
# Output: Hello world

5️⃣ Data Sanitization

text = "Credit Card: 4111-1111-1111-1111"
safe = re.sub(r"\d{4}-\d{4}-\d{4}-\d{4}", "****-****-****-****", text)

🧭 9. Debugging and Testing Regex

Use online regex testers such as:

Or debug inside Python:

import re
re.debug(r"\d+")

Also use re.VERBOSE to write multi-line, well-commented regex for readability.


📊 10. UML-Style Regex Flow

+----------------------+
|  Input String        |
+----------------------+

+----------------------+
|  Regex Pattern       |
|  (e.g. \d{3}-\d{2}) |
+----------------------+

+----------------------+
|  re.search() / findall() |
+----------------------+

+----------------------+
|  Match Object        |
|  .group(), .groups() |
+----------------------+

+----------------------+
|  Substitution / Split |
+----------------------+

🧩 11. Regex Cheat Sheet

ExpressionMeaning
\dDigit (0–9)
\DNon-digit
\wWord char (A–Z, a–z, 0–9, _)
\WNon-word char
\sWhitespace
\SNon-whitespace
\bWord boundary
^...$Start and end of string
(...)Capturing group
(?:...)Non-capturing group
(?P<name>...)Named capturing group
(?=...)Lookahead
(?!...)Negative lookahead

🧾 12. Best Practices

✅ Always use raw strings for regex patterns: r"\d+"
✅ Precompile complex patterns with re.compile() for performance.
✅ Avoid overly greedy patterns — use *?, +? when needed.
✅ Combine re.VERBOSE with comments for maintainability.
✅ Test patterns thoroughly — regex can silently overmatch.
✅ For structured data (JSON, XML, CSV), prefer parsers over regex.


🧠 Summary


By mastering regex, you unlock one of the most potent tools in text processing — turning raw data into structured insight.