PostBazzar - Professional Blog

Introduction (Understanding the Basics)

What is a Regular Expression (Regex)?

In simple words: Regex is a powerful pattern-matching language that helps you find, check, or modify specific patterns in text.

Real-life example:

You have 1000 emails. You want to extract all phone numbers from them.

Without regex: Takes 2 hours (checking line by line)
With regex: Takes 2 seconds (just write \d{3}-\d{3}-\d{4})

Why is Text Cleaning Necessary?

Raw text is messy. For example:

Raw Tweet:

"RT @user123: I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome"

Cleaned Text:

"i love this product"

Why clean?

A machine learning model will not treat "LOOOOOOVE" and "love" as the same word unless you clean the text first.

Real-World Use Cases

What Will You Learn ?

Basic regex patterns (digits, words, spaces, quantifiers)
Using Python's re module (findall, sub, match, search)
Cleaning real tweets, logs, and emails
Building your own text cleaning pipeline

Practical Lab (Step-by-Step)

Step 1: Regex Building Blocks (Reference Table)

Step 2: How to Use Regex in Python (4 Main Functions)

import re

text = "My email is john@gmail.com and my phone is 123-456-7890"

# 1. findall() - Find all matches and return as list
emails = re.findall(r'\w+@\w+\.\w+', text)
print(emails)  # ['john@gmail.com']

# 2. search() - Find first match and return match object
match = re.search(r'\d{3}-\d{3}-\d{4}', text)
print(match.group())  # '123-456-7890'

# 3. sub() - Substitute/replace patterns
cleaned = re.sub(r'\d', 'X', text)  # Replace all digits with X
print(cleaned)  # "My email is john@gmail.com and my phone is XXX-XXX-XXXX"

# 4. match() - Check if string starts with pattern
result = re.match(r'My', text)
print(bool(result))  # True (because string starts with "My")

Step 3: Real Example - Cleaning a Twitter Tweet

Problem: Remove @mentions, hashtags, URLs, and extra punctuation from a tweet.

import re

tweet = "RT @user123: I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome"

print("Original:", tweet)
print()

# Step 1: Remove mentions (@username)
step1 = re.sub(r'@\w+', '', tweet)
print("Step 1 - Remove mentions:", step1)

# Step 2: Remove URLs
step2 = re.sub(r'https?://\S+', '', step1)
print("Step 2 - Remove URLs:", step2)

# Step 3: Remove hashtags
step3 = re.sub(r'#\w+', '', step2)
print("Step 3 - Remove hashtags:", step3)

# Step 4: Remove punctuation (keep only letters, numbers, and spaces)
step4 = re.sub(r'[^\w\s]', '', step3)
print("Step 4 - Remove punctuation:", step4)

# Step 5: Convert to lowercase
step5 = step4.lower()
print("Step 5 - Convert to lowercase:", step5)

# Step 6: Convert multiple spaces to single space
final = re.sub(r'\s+', ' ', step5).strip()
print("Step 6 - Final cleaned text:", final)

Output:

Original: RT @user123: I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome

Step 1 - Remove mentions: RT : I LOOOOOOVE this product!!!!! 🚀🚀 https://t.co/abc #awesome
Step 2 - Remove URLs: RT : I LOOOOOOVE this product!!!!! 🚀🚀  #awesome
Step 3 - Remove hashtags: RT : I LOOOOOOVE this product!!!!! 🚀🚀  
Step 4 - Remove punctuation: RT  I LOOOOOOVE this product  🚀🚀  
Step 5 - Convert to lowercase: rt  i loooove this product  🚀🚀  
Step 6 - Final cleaned text: rt i loooove this product 🚀🚀

Step 4: Email Extraction (Practical Lab Task)

import re

# Complex example with multiple emails
text = """
Contact us:
- Support: support@company.com
- Sales: sales@company.co.uk
- John: john.doe123@my-email.net
- Invalid: admin@.com
- Another: user.name+tag@gmail.com
- Mary: mary_123@sub.domain.org
"""

# Pattern explanation:
# [\w\.\+-]+  : email username (letters, dots, hyphens, plus)
# @            : literal @ symbol
# [\w\.-]+     : domain name
# \.           : dot
# \w+          : TLD (com, org, uk, etc.)
pattern = r'[\w\.\+-]+@[\w\.-]+\.\w+'

emails = re.findall(pattern, text)
print("Extracted emails:")
for i, email in enumerate(emails, 1):
    print(f"  {i}. {email}")

Step 5: Phone Number Validation (Real Form Example)

import re

def validate_phone(phone):
    """
    Validate phone numbers in various formats:
    - 123-456-7890
    - (123) 456-7890
    - 123.456.7890
    - 123 456 7890
    - 1234567890
    """
    pattern = r'^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
    
    if re.match(pattern, phone):
        return True, "Valid"
    else:
        return False, "Invalid"

# Test the function
test_numbers = [
    "123-456-7890",      # Valid
    "(123) 456-7890",    # Valid
    "123.456.7890",      # Valid
    "123 456 7890",      # Valid
    "1234567890",        # Valid (without separator)
    "123-45-6789",       # Invalid
    "12345",             # Invalid
    "abc-def-ghij",      # Invalid
    "1-800-FLOWERS",     # Invalid (letters not allowed)
    "+1-123-456-7890"    # Invalid (country code)
]

print("Phone Number Validation Results:")
print("-" * 45)
for num in test_numbers:
    status, msg = validate_phone(num)
    symbol = "✓" if status else "✗"
    print(f"{symbol} {num:20} → {msg}")

Step 6: Date Extraction from Text

import re

def extract_dates(text):
    """
    Extract dates in various formats:
    - YYYY-MM-DD
    - MM/DD/YYYY
    - DD-MM-YYYY
    """
    patterns = {
        'YYYY-MM-DD': r'\d{4}-\d{2}-\d{2}',
        'MM/DD/YYYY': r'\d{2}/\d{2}/\d{4}',
        'DD-MM-YYYY': r'\d{2}-\d{2}-\d{4}'
    }
    
    all_dates = []
    for format_name, pattern in patterns.items():
        matches = re.findall(pattern, text)
        for match in matches:
            all_dates.append((match, format_name))
    
    return all_dates

# Test
sample_text = """
Event dates:
- Conference: 2024-12-15
- Workshop: 12/20/2024
- Deadline: 31-01-2025
- Meeting: 2025-03-10
"""

dates = extract_dates(sample_text)
print("Extracted Dates:")
for date, format_type in dates:
    print(f"  {date} (format: {format_type})")

Step 7: Complete Text Cleaning Pipeline

import re

class TextCleaner:
    """
    A complete text cleaning pipeline
    """
    
    @staticmethod
    def remove_html_tags(text):
        """Remove HTML/XML tags"""
        return re.sub(r'<[^>]+>', '', text)
    
    @staticmethod
    def remove_urls(text):
        """Remove URLs"""
        return re.sub(r'https?://\S+|www\.\S+', '', text)
    
    @staticmethod
    def remove_mentions_hashtags(text):
        """Remove @mentions and #hashtags"""
        text = re.sub(r'@\w+', '', text)
        text = re.sub(r'#\w+', '', text)
        return text
    
    @staticmethod
    def remove_punctuation(text):
        """Remove punctuation (keep letters, numbers, spaces)"""
        return re.sub(r'[^\w\s]', '', text)
    
    @staticmethod
    def to_lowercase(text):
        """Convert to lowercase"""
        return text.lower()
    
    @staticmethod
    def remove_extra_spaces(text):
        """Remove extra whitespace"""
        return re.sub(r'\s+', ' ', text).strip()
    
    @classmethod
    def clean(cls, text, verbose=False):
        """
        Run complete cleaning pipeline
        """
        if verbose:
            print("Original:", text)
        
        text = cls.remove_html_tags(text)
        text = cls.remove_urls(text)
        text = cls.remove_mentions_hashtags(text)
        text = cls.remove_punctuation(text)
        text = cls.to_lowercase(text)
        text = cls.remove_extra_spaces(text)
        
        if verbose:
            print("Cleaned:", text)
        
        return text

# Test the pipeline
sample_text = """
<html>Hello World!</html> 
Check out https://example.com and @john_doe said #awesome!!!
This is a MESSY text!!! with   multiple    spaces.
"""

cleaned = TextCleaner.clean(sample_text, verbose=True)

Step 8: Log File Parsing Example

import re

def parse_log_line(log_line):
    """
    Parse a server log line and extract information
    Log format: [LEVEL] YYYY-MM-DD HH:MM:SS,SSS Message
    """
    pattern = r'\[(\w+)\]\s+(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3})\s+(.+)'
    match = re.search(pattern, log_line)
    
    if match:
        return {
            'level': match.group(1),
            'timestamp': match.group(2),
            'message': match.group(3)
        }
    return None

# Sample log lines
logs = [
    "[INFO] 2024-09-21 10:30:45,123 Server started successfully",
    "[ERROR] 2024-09-21 10:31:02,456 Database connection failed",
    "[WARNING] 2024-09-21 10:32:18,789 High memory usage detected"
]

print("Parsed Log Entries:")
print("-" * 60)
for log in logs:
    parsed = parse_log_line(log)
    if parsed:
        print(f"Level: {parsed['level']}")
        print(f"Time:  {parsed['timestamp']}")
        print(f"Msg:   {parsed['message']}")
        print("-" * 60)

Lab Assignment

Problem: Extract Information from Log File

You are given a log file line:

[ERROR] 2024-09-21 10:30:45,123 User 'john_doe' failed to login from IP 192.168.1.1

Tasks:

Extract timestamp (YYYY-MM-DD HH:MM:SS,mmm)
Extract username (between quotes)
Extract IP address
Create a clean dictionary with extracted information

import re

# Given log line
log_line = "[ERROR] 2024-09-21 10:30:45,123 User 'john_doe' failed to login from IP 192.168.1.1"

# Write your regex patterns here
timestamp_pattern = r''    # Fill this
username_pattern = r''     # Fill this
ip_pattern = r''           # Fill this

# Extract using re.search()
# Your code here

# Expected output:
# {
#     'timestamp': '2024-09-21 10:30:45,123',
#     'username': 'john_doe',
#     'ip': '192.168.1.1'
# }

Solution (for checking):

import re

log_line = "[ERROR] 2024-09-21 10:30:45,123 User 'john_doe' failed to login from IP 192.168.1.1"

timestamp_pattern = r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}'
username_pattern = r"'(\w+)'"
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

timestamp = re.search(timestamp_pattern, log_line).group()
username = re.search(username_pattern, log_line).group(1)
ip = re.search(ip_pattern, log_line).group()

result = {
    'timestamp': timestamp,
    'username': username,
    'ip': ip
}

print(result)

Regular Expressions and Text Cleaning

Introduction (Understanding the Basics)

What is a Regular Expression (Regex)?

Why is Text Cleaning Necessary?

Real-World Use Cases

What Will You Learn ?

Practical Lab (Step-by-Step)

Step 1: Regex Building Blocks (Reference Table)

Step 2: How to Use Regex in Python (4 Main Functions)

Step 3: Real Example - Cleaning a Twitter Tweet

Step 4: Email Extraction (Practical Lab Task)

Step 5: Phone Number Validation (Real Form Example)

Step 6: Date Extraction from Text

Step 7: Complete Text Cleaning Pipeline

Step 8: Log File Parsing Example

Lab Assignment

Problem: Extract Information from Log File

Solution (for checking):

Dr.Muhammad Azam

Comments (0)

No Comments Yet

Leave a Comment

Categories

Recent Posts

Why Slow Fashion Is Becoming the New Standard

10+ Diet Myths That Need to Disappear common diet myths

The Real Difference Between Talent and Discipline in Sports

Why Most Scholarship Applications Get Rejected

Why Most Startups Fail Before They Are Two Years Old

Newsletter