ch6s3_RegularExpressions

Regular expressions (regex) are one of the most powerful tools for text analysis, validation, and transformation.

Chapter 6: Python Standard Library

Sub-Chapter: Regular Expressions — Pattern Matching and Text Manipulation

Regular expressions (regex) are one of the most powerful tools for text analysis, validation, and transformation.
They allow you to search, match, and extract complex patterns from text with concise syntax.
Python’s re module provides comprehensive support for regular expressions, combining flexibility with performance.

🧱 1. Importing the `re` Module

import re

This module provides functions such as search(), match(), findall(), split(), and sub() for working with regex patterns.

🔍 2. Basic Pattern Matching

Example

pattern = r"\d+"  # Matches one or more digits
match = re.search(pattern, "The price is $100")
print(match.group())  # Output: 100

Common Metacharacters

Symbol	Description	Example	Matches
`.`	Any character except newline	`a.b`	`a1b`, `acb`
`^`	Start of string	`^Hello`	`Hello world`
`$`	End of string	`world$`	`Hello world`
`*`	0 or more repetitions	`bo*`	`b`, `bo`, `boo`
`+`	1 or more repetitions	`go+`	`go`, `goo`
`?`	0 or 1 repetition	`colou?r`	`color`, `colour`
`[]`	Character set	`[A-Z]`	Any capital letter
`{n,m}`	Range of repetitions	`\d{2,4}`	`99`, `2024`
`\`	Escape special char	`\.`	A literal dot

🧩 3. Searching, Matching, and Finding

`search()` — Finds first match anywhere in the string

re.search(r"\d+", "Order ID: 12345").group()
# Output: '12345'

`match()` — Matches only at the beginning

re.match(r"\d+", "123abc").group()
# Output: '123'

`findall()` — Returns all matches

re.findall(r"\d+", "A1 B22 C333")
# Output: ['1', '22', '333']

`finditer()` — Returns iterator of match objects

for m in re.finditer(r"\d+", "12 and 34"):
    print(m.group(), "at", m.start())

`split()` — Split text by a regex pattern

re.split(r"\W+", "Words, separated_by symbols!")
# Output: ['Words', 'separated_by', 'symbols']

🧱 4. Grouping and Capturing

Parentheses () allow grouping parts of a regex and extracting submatches.

pattern = r"(\d{3})-(\d{2})"
match = re.search(pattern, "Phone: 123-45")
area, number = match.groups()
print(area, number)  # 123 45

Named Groups

pattern = r"(?P<area>\d{3})-(?P<number>\d{2})"
m = re.search(pattern, "Code: 555-12")
print(m.group("area"), m.group("number"))

Non-Capturing Groups

re.findall(r"(?:Mr|Mrs|Ms)\.\s[A-Z][a-z]+", "Mr. Smith and Ms. Clark")
# Output: ['Mr. Smith', 'Ms. Clark']

🔄 5. Replacement and Substitution

Use re.sub() to replace text matching a pattern.

text = "Apples are great. I like apple pie."
result = re.sub(r"(?i)apple", "banana", text)
print(result)

Output:

Bananas are great. I like banana pie.

Using a Function in Replacement

def censor_email(match):
    user, domain = match.group().split("@")
    return f"{user[0]}***@{domain}"

text = "Contact me at hello@example.com"
print(re.sub(r"[\w.-]+@[\w.-]+", censor_email, text))
# Output: h***@example.com

⚙️ 6. Flags (Modifiers)

Regex flags modify how the pattern behaves.

Flag	Description	Usage
`re.IGNORECASE` or `re.I`	Case-insensitive matching	`re.search(r"python", "PYTHON", re.I)`
`re.MULTILINE` or `re.M`	`^` and `$` match start/end of each line	`re.search("^abc", text, re.M)`
`re.DOTALL` or `re.S`	`.` matches newline too	`re.search("a.*b", "a\n\nb", re.S)`
`re.VERBOSE` or `re.X`	Allow multi-line, commented regex	Useful for readability

Example with Multiple Flags

pattern = re.compile(r"python.*rocks", re.I | re.S)
pattern.search("PYTHON\nreally ROCKS!")

🧮 7. Compiling Regular Expressions

Compiling improves performance when reusing patterns frequently.

phone_pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
print(phone_pattern.findall("Call 123-456-7890 or 987-654-3210"))

🧰 8. Real-World Use Cases

1️⃣ Email Extraction

emails = re.findall(r"[\w.-]+@[\w.-]+", "Contact support@example.com, admin@test.org")

2️⃣ Phone Validation

phone_numbers = ["123-456-7890", "555-1234", "800-555-5555"]
valid = [p for p in phone_numbers if re.fullmatch(r"\d{3}-\d{3}-\d{4}", p)]

3️⃣ Log Parsing

log = "2025-10-26 ERROR: Connection failed at 14:31:22"
parts = re.search(r"(\d{4}-\d{2}-\d{2})\s(\w+):\s(.+)", log)
print(parts.groups())

4️⃣ HTML Tag Removal

clean = re.sub(r"<.*?>", "", "<p>Hello <b>world</b></p>")
# Output: Hello world

5️⃣ Data Sanitization

text = "Credit Card: 4111-1111-1111-1111"
safe = re.sub(r"\d{4}-\d{4}-\d{4}-\d{4}", "****-****-****-****", text)

🧭 9. Debugging and Testing Regex

Use online regex testers such as:

Or debug inside Python:

import re
re.debug(r"\d+")

Also use re.VERBOSE to write multi-line, well-commented regex for readability.

📊 10. UML-Style Regex Flow

+----------------------+
|  Input String        |
+----------------------+
          ↓
+----------------------+
|  Regex Pattern       |
|  (e.g. \d{3}-\d{2}) |
+----------------------+
          ↓
+----------------------+
|  re.search() / findall() |
+----------------------+
          ↓
+----------------------+
|  Match Object        |
|  .group(), .groups() |
+----------------------+
          ↓
+----------------------+
|  Substitution / Split |
+----------------------+

🧩 11. Regex Cheat Sheet

Expression	Meaning
`\d`	Digit (0–9)
`\D`	Non-digit
`\w`	Word char (A–Z, a–z, 0–9, _)
`\W`	Non-word char
`\s`	Whitespace
`\S`	Non-whitespace
`\b`	Word boundary
`^...$`	Start and end of string
`(...)`	Capturing group
`(?:...)`	Non-capturing group
`(?P<name>...)`	Named capturing group
`(?=...)`	Lookahead
`(?!...)`	Negative lookahead

🧾 12. Best Practices

✅ Always use raw strings for regex patterns: r"\d+"
✅ Precompile complex patterns with re.compile() for performance.
✅ Avoid overly greedy patterns — use *?, +? when needed.
✅ Combine re.VERBOSE with comments for maintainability.
✅ Test patterns thoroughly — regex can silently overmatch.
✅ For structured data (JSON, XML, CSV), prefer parsers over regex.

🧠 Summary

Regular expressions let you search, extract, and replace text patterns efficiently.
Python’s re module supports flags, groups, and substitutions.
Used carefully, regex provides unmatched power in data cleaning and text analysis.
Always balance complexity with readability — a clear regex is a maintainable one.

By mastering regex, you unlock one of the most potent tools in text processing — turning raw data into structured insight.