Introduction (Understanding the Basics)

What is a Regular Expression (Regex)?

In simple words: Regex is a powerful pattern-matching language that helps you find, check, or modify specific patterns in text.

Real-life example:

You have 1000 emails. You want to extract all phone numbers from them.

  • Without regex: Takes 2 hours (checking line by line)
  • With regex: Takes 2 seconds (just write \d{3}-\d{3}-\d{4})

Why is Text Cleaning Necessary?

Raw text is messy. For example:

Raw Tweet:

"RT @user123: I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome"

Cleaned Text:

"i love this product"

Why clean?

A machine learning model will not treat "LOOOOOOVE" and "love" as the same word unless you clean the text first.

Real-World Use Cases

What Will You Learn ?

  1. Basic regex patterns (digits, words, spaces, quantifiers)
  2. Using Python's re module (findallsubmatchsearch)
  3. Cleaning real tweets, logs, and emails
  4. Building your own text cleaning pipeline

Practical Lab (Step-by-Step)

Step 1: Regex Building Blocks (Reference Table)


Step 2: How to Use Regex in Python (4 Main Functions)

import re

text = "My email is john@gmail.com and my phone is 123-456-7890"

# 1. findall() - Find all matches and return as list
emails = re.findall(r'\w+@\w+\.\w+', text)
print(emails)  # ['john@gmail.com']

# 2. search() - Find first match and return match object
match = re.search(r'\d{3}-\d{3}-\d{4}', text)
print(match.group())  # '123-456-7890'

# 3. sub() - Substitute/replace patterns
cleaned = re.sub(r'\d', 'X', text)  # Replace all digits with X
print(cleaned)  # "My email is john@gmail.com and my phone is XXX-XXX-XXXX"

# 4. match() - Check if string starts with pattern
result = re.match(r'My', text)
print(bool(result))  # True (because string starts with "My")


Step 3: Real Example - Cleaning a Twitter Tweet

Problem: Remove @mentions, hashtags, URLs, and extra punctuation from a tweet.

import re

tweet = "RT @user123: I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome"

print("Original:", tweet)
print()

# Step 1: Remove mentions (@username)
step1 = re.sub(r'@\w+', '', tweet)
print("Step 1 - Remove mentions:", step1)

# Step 2: Remove URLs
step2 = re.sub(r'https?://\S+', '', step1)
print("Step 2 - Remove URLs:", step2)

# Step 3: Remove hashtags
step3 = re.sub(r'#\w+', '', step2)
print("Step 3 - Remove hashtags:", step3)

# Step 4: Remove punctuation (keep only letters, numbers, and spaces)
step4 = re.sub(r'[^\w\s]', '', step3)
print("Step 4 - Remove punctuation:", step4)

# Step 5: Convert to lowercase
step5 = step4.lower()
print("Step 5 - Convert to lowercase:", step5)

# Step 6: Convert multiple spaces to single space
final = re.sub(r'\s+', ' ', step5).strip()
print("Step 6 - Final cleaned text:", final)

Output:

Original: RT @user123: I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome

Step 1 - Remove mentions: RT : I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome
Step 2 - Remove URLs: RT : I LOOOOOOVE this product!!!!! 🚀🚀  #awesome
Step 3 - Remove hashtags: RT : I LOOOOOOVE this product!!!!! 🚀🚀  
Step 4 - Remove punctuation: RT  I LOOOOOOVE this product  🚀🚀  
Step 5 - Convert to lowercase: rt  i loooove this product  🚀🚀  
Step 6 - Final cleaned text: rt i loooove this product 🚀🚀

Step 4: Email Extraction (Practical Lab Task)

import re

# Complex example with multiple emails
text = """
Contact us:
- Support: support@company.com
- Sales: sales@company.co.uk
- John: john.doe123@my-email.net
- Invalid: admin@.com
- Another: user.name+tag@gmail.com
- Mary: mary_123@sub.domain.org
"""

# Pattern explanation:
# [\w\.\+-]+  : email username (letters, dots, hyphens, plus)
# @            : literal @ symbol
# [\w\.-]+     : domain name
# \.           : dot
# \w+          : TLD (com, org, uk, etc.)
pattern = r'[\w\.\+-]+@[\w\.-]+\.\w+'

emails = re.findall(pattern, text)
print("Extracted emails:")
for i, email in enumerate(emails, 1):
    print(f"  {i}. {email}")

Step 5: Phone Number Validation (Real Form Example)

import re

def validate_phone(phone):
    """
    Validate phone numbers in various formats:
    - 123-456-7890
    - (123) 456-7890
    - 123.456.7890
    - 123 456 7890
    - 1234567890
    """
    pattern = r'^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
    
    if re.match(pattern, phone):
        return True, "Valid"
    else:
        return False, "Invalid"

# Test the function
test_numbers = [
    "123-456-7890",      # Valid
    "(123) 456-7890",    # Valid
    "123.456.7890",      # Valid
    "123 456 7890",      # Valid
    "1234567890",        # Valid (without separator)
    "123-45-6789",       # Invalid
    "12345",             # Invalid
    "abc-def-ghij",      # Invalid
    "1-800-FLOWERS",     # Invalid (letters not allowed)
    "+1-123-456-7890"    # Invalid (country code)
]

print("Phone Number Validation Results:")
print("-" * 45)
for num in test_numbers:
    status, msg = validate_phone(num)
    symbol = "✓" if status else "✗"
    print(f"{symbol} {num:20} → {msg}")

Step 6: Date Extraction from Text

import re

def extract_dates(text):
    """
    Extract dates in various formats:
    - YYYY-MM-DD
    - MM/DD/YYYY
    - DD-MM-YYYY
    """
    patterns = {
        'YYYY-MM-DD': r'\d{4}-\d{2}-\d{2}',
        'MM/DD/YYYY': r'\d{2}/\d{2}/\d{4}',
        'DD-MM-YYYY': r'\d{2}-\d{2}-\d{4}'
    }
    
    all_dates = []
    for format_name, pattern in patterns.items():
        matches = re.findall(pattern, text)
        for match in matches:
            all_dates.append((match, format_name))
    
    return all_dates

# Test
sample_text = """
Event dates:
- Conference: 2024-12-15
- Workshop: 12/20/2024
- Deadline: 31-01-2025
- Meeting: 2025-03-10
"""

dates = extract_dates(sample_text)
print("Extracted Dates:")
for date, format_type in dates:
    print(f"  {date} (format: {format_type})")

Step 7: Complete Text Cleaning Pipeline

import re

class TextCleaner:
    """
    A complete text cleaning pipeline
    """
    
    @staticmethod
    def remove_html_tags(text):
        """Remove HTML/XML tags"""
        return re.sub(r'<[^>]+>', '', text)
    
    @staticmethod
    def remove_urls(text):
        """Remove URLs"""
        return re.sub(r'https?://\S+|www\.\S+', '', text)
    
    @staticmethod
    def remove_mentions_hashtags(text):
        """Remove @mentions and #hashtags"""
        text = re.sub(r'@\w+', '', text)
        text = re.sub(r'#\w+', '', text)
        return text
    
    @staticmethod
    def remove_punctuation(text):
        """Remove punctuation (keep letters, numbers, spaces)"""
        return re.sub(r'[^\w\s]', '', text)
    
    @staticmethod
    def to_lowercase(text):
        """Convert to lowercase"""
        return text.lower()
    
    @staticmethod
    def remove_extra_spaces(text):
        """Remove extra whitespace"""
        return re.sub(r'\s+', ' ', text).strip()
    
    @classmethod
    def clean(cls, text, verbose=False):
        """
        Run complete cleaning pipeline
        """
        if verbose:
            print("Original:", text)
        
        text = cls.remove_html_tags(text)
        text = cls.remove_urls(text)
        text = cls.remove_mentions_hashtags(text)
        text = cls.remove_punctuation(text)
        text = cls.to_lowercase(text)
        text = cls.remove_extra_spaces(text)
        
        if verbose:
            print("Cleaned:", text)
        
        return text

# Test the pipeline
sample_text = """
<html>Hello World!</html> 
Check out https://example.com and @john_doe said #awesome!!!
This is a MESSY text!!! with   multiple    spaces.
"""

cleaned = TextCleaner.clean(sample_text, verbose=True)

Step 8: Log File Parsing Example

import re

def parse_log_line(log_line):
    """
    Parse a server log line and extract information
    Log format: [LEVEL] YYYY-MM-DD HH:MM:SS,SSS Message
    """
    pattern = r'\[(\w+)\]\s+(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3})\s+(.+)'
    match = re.search(pattern, log_line)
    
    if match:
        return {
            'level': match.group(1),
            'timestamp': match.group(2),
            'message': match.group(3)
        }
    return None

# Sample log lines
logs = [
    "[INFO] 2024-09-21 10:30:45,123 Server started successfully",
    "[ERROR] 2024-09-21 10:31:02,456 Database connection failed",
    "[WARNING] 2024-09-21 10:32:18,789 High memory usage detected"
]

print("Parsed Log Entries:")
print("-" * 60)
for log in logs:
    parsed = parse_log_line(log)
    if parsed:
        print(f"Level: {parsed['level']}")
        print(f"Time:  {parsed['timestamp']}")
        print(f"Msg:   {parsed['message']}")
        print("-" * 60)

Lab Assignment

Problem: Extract Information from Log File

You are given a log file line:

[ERROR] 2024-09-21 10:30:45,123 User 'john_doe' failed to login from IP 192.168.1.1

Tasks:

  1. Extract timestamp (YYYY-MM-DD HH:MM:SS,mmm)
  2. Extract username (between quotes)
  3. Extract IP address
  4. Create a clean dictionary with extracted information
import re

# Given log line
log_line = "[ERROR] 2024-09-21 10:30:45,123 User 'john_doe' failed to login from IP 192.168.1.1"

# Write your regex patterns here
timestamp_pattern = r''    # Fill this
username_pattern = r''     # Fill this
ip_pattern = r''           # Fill this

# Extract using re.search()
# Your code here

# Expected output:
# {
#     'timestamp': '2024-09-21 10:30:45,123',
#     'username': 'john_doe',
#     'ip': '192.168.1.1'
# }

Solution (for checking):

import re

log_line = "[ERROR] 2024-09-21 10:30:45,123 User 'john_doe' failed to login from IP 192.168.1.1"

timestamp_pattern = r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}'
username_pattern = r"'(\w+)'"
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

timestamp = re.search(timestamp_pattern, log_line).group()
username = re.search(username_pattern, log_line).group(1)
ip = re.search(ip_pattern, log_line).group()

result = {
    'timestamp': timestamp,
    'username': username,
    'ip': ip
}

print(result)