ch6s3_RegularExpressions
Regular expressions (regex) are one of the most powerful tools for text analysis, validation, and transformation.
Chapter 6: Python Standard Library
Sub-Chapter: Regular Expressions — Pattern Matching and Text Manipulation
Regular expressions (regex) are one of the most powerful tools for text analysis, validation, and transformation.
They allow you to search, match, and extract complex patterns from text with concise syntax.
Python’s re module provides comprehensive support for regular expressions, combining flexibility with performance.
🧱 1. Importing the re Module
import re
This module provides functions such as search(), match(), findall(), split(), and sub() for working with regex patterns.
🔍 2. Basic Pattern Matching
Example
pattern = r"\d+" # Matches one or more digits
match = re.search(pattern, "The price is $100")
print(match.group()) # Output: 100
Common Metacharacters
| Symbol | Description | Example | Matches |
|---|---|---|---|
. | Any character except newline | a.b | a1b, acb |
^ | Start of string | ^Hello | Hello world |
$ | End of string | world$ | Hello world |
* | 0 or more repetitions | bo* | b, bo, boo |
+ | 1 or more repetitions | go+ | go, goo |
? | 0 or 1 repetition | colou?r | color, colour |
[] | Character set | [A-Z] | Any capital letter |
{n,m} | Range of repetitions | \d{2,4} | 99, 2024 |
\ | Escape special char | \. | A literal dot |
🧩 3. Searching, Matching, and Finding
search() — Finds first match anywhere in the string
re.search(r"\d+", "Order ID: 12345").group()
# Output: '12345'
match() — Matches only at the beginning
re.match(r"\d+", "123abc").group()
# Output: '123'
findall() — Returns all matches
re.findall(r"\d+", "A1 B22 C333")
# Output: ['1', '22', '333']
finditer() — Returns iterator of match objects
for m in re.finditer(r"\d+", "12 and 34"):
print(m.group(), "at", m.start())
split() — Split text by a regex pattern
re.split(r"\W+", "Words, separated_by symbols!")
# Output: ['Words', 'separated_by', 'symbols']
🧱 4. Grouping and Capturing
Parentheses () allow grouping parts of a regex and extracting submatches.
pattern = r"(\d{3})-(\d{2})"
match = re.search(pattern, "Phone: 123-45")
area, number = match.groups()
print(area, number) # 123 45
Named Groups
pattern = r"(?P<area>\d{3})-(?P<number>\d{2})"
m = re.search(pattern, "Code: 555-12")
print(m.group("area"), m.group("number"))
Non-Capturing Groups
re.findall(r"(?:Mr|Mrs|Ms)\.\s[A-Z][a-z]+", "Mr. Smith and Ms. Clark")
# Output: ['Mr. Smith', 'Ms. Clark']
🔄 5. Replacement and Substitution
Use re.sub() to replace text matching a pattern.
text = "Apples are great. I like apple pie."
result = re.sub(r"(?i)apple", "banana", text)
print(result)
Output:
Bananas are great. I like banana pie.
Using a Function in Replacement
def censor_email(match):
user, domain = match.group().split("@")
return f"{user[0]}***@{domain}"
text = "Contact me at hello@example.com"
print(re.sub(r"[\w.-]+@[\w.-]+", censor_email, text))
# Output: h***@example.com
⚙️ 6. Flags (Modifiers)
Regex flags modify how the pattern behaves.
| Flag | Description | Usage |
|---|---|---|
re.IGNORECASE or re.I | Case-insensitive matching | re.search(r"python", "PYTHON", re.I) |
re.MULTILINE or re.M | ^ and $ match start/end of each line | re.search("^abc", text, re.M) |
re.DOTALL or re.S | . matches newline too | re.search("a.*b", "a\n\nb", re.S) |
re.VERBOSE or re.X | Allow multi-line, commented regex | Useful for readability |
Example with Multiple Flags
pattern = re.compile(r"python.*rocks", re.I | re.S)
pattern.search("PYTHON\nreally ROCKS!")
🧮 7. Compiling Regular Expressions
Compiling improves performance when reusing patterns frequently.
phone_pattern = re.compile(r"\d{3}-\d{3}-\d{4}")
print(phone_pattern.findall("Call 123-456-7890 or 987-654-3210"))
🧰 8. Real-World Use Cases
1️⃣ Email Extraction
emails = re.findall(r"[\w.-]+@[\w.-]+", "Contact support@example.com, admin@test.org")
2️⃣ Phone Validation
phone_numbers = ["123-456-7890", "555-1234", "800-555-5555"]
valid = [p for p in phone_numbers if re.fullmatch(r"\d{3}-\d{3}-\d{4}", p)]
3️⃣ Log Parsing
log = "2025-10-26 ERROR: Connection failed at 14:31:22"
parts = re.search(r"(\d{4}-\d{2}-\d{2})\s(\w+):\s(.+)", log)
print(parts.groups())
4️⃣ HTML Tag Removal
clean = re.sub(r"<.*?>", "", "<p>Hello <b>world</b></p>")
# Output: Hello world
5️⃣ Data Sanitization
text = "Credit Card: 4111-1111-1111-1111"
safe = re.sub(r"\d{4}-\d{4}-\d{4}-\d{4}", "****-****-****-****", text)
🧭 9. Debugging and Testing Regex
Use online regex testers such as:
Or debug inside Python:
import re
re.debug(r"\d+")
Also use
re.VERBOSEto write multi-line, well-commented regex for readability.
📊 10. UML-Style Regex Flow
+----------------------+
| Input String |
+----------------------+
↓
+----------------------+
| Regex Pattern |
| (e.g. \d{3}-\d{2}) |
+----------------------+
↓
+----------------------+
| re.search() / findall() |
+----------------------+
↓
+----------------------+
| Match Object |
| .group(), .groups() |
+----------------------+
↓
+----------------------+
| Substitution / Split |
+----------------------+
🧩 11. Regex Cheat Sheet
| Expression | Meaning |
|---|---|
\d | Digit (0–9) |
\D | Non-digit |
\w | Word char (A–Z, a–z, 0–9, _) |
\W | Non-word char |
\s | Whitespace |
\S | Non-whitespace |
\b | Word boundary |
^...$ | Start and end of string |
(...) | Capturing group |
(?:...) | Non-capturing group |
(?P<name>...) | Named capturing group |
(?=...) | Lookahead |
(?!...) | Negative lookahead |
🧾 12. Best Practices
✅ Always use raw strings for regex patterns: r"\d+"
✅ Precompile complex patterns with re.compile() for performance.
✅ Avoid overly greedy patterns — use *?, +? when needed.
✅ Combine re.VERBOSE with comments for maintainability.
✅ Test patterns thoroughly — regex can silently overmatch.
✅ For structured data (JSON, XML, CSV), prefer parsers over regex.
🧠 Summary
- Regular expressions let you search, extract, and replace text patterns efficiently.
- Python’s
remodule supports flags, groups, and substitutions. - Used carefully, regex provides unmatched power in data cleaning and text analysis.
- Always balance complexity with readability — a clear regex is a maintainable one.
By mastering regex, you unlock one of the most potent tools in text processing — turning raw data into structured insight.