Convert Outlook (.msg) to Thunderbird (.eml) - free scripts

* editorial posted by c_prompt in microsoft

With the help of AI tools and Redemption, I've created two Microsoft Outlook conversion tools to facilitate my journey from Windows to Linux. Moving decades of email messages (with hidden corruption from Outlook's many crashes and unreliability over the years) is a critical step to reinforce Linux as my daily-driver laptop. In summary, Outlook itself doesn't produce clean, faithful RFC 5322 (.eml) files from a MAPI store without losing fidelity and I wasn't able to find any third-party tool or code to do so either.

These two scripts - one written in PowerShell and one in Python - address all issues I found when trying to convert Outlook messages (directly or via exported .msg files) to .eml files (which are then imported into an open-sourced client), including:

Outlook (classic) does not export to .eml files directly
Dragging messages from Outlook to Windows Explorer (which creates .msg files) has to be done in small batches to prevent freezing, can produce corrupted .msg files relative to Outlook's UI (e.g., hidden messages not visible via the UI, .msg files without sender and/or recipient display names and/or email addresses, incorrect dates/times), and can produce seemingly duplicate .msg files for the same message with slight differences (making it difficult to identify and delete duplicates)
Using Outlook's COM automation object to build message files does not maintain fidelity to the original emails
There appear to be many questionable (closed-source, third-party) software tools for purchase that claim to export Outlook messages to .eml files that I don't trust to look at confidential emails (or just don't work reliably)
All of the existing Windows and Linux open-sourced and free tools (including other email clients like Thunderbird, Evolution, and em Client) I tried had problems including freezing when trying to import/convert tens of thousands of emails, missing hidden emails, missing messages of a certain class (e.g., calendar invites and responses), trying to create files with malformed names (e.g., over the Windows total path length character limit, encoded paths and filenames), not maintaining fidelity to the original emails, and poor to non-existent debug logging
Different tools and processes to export or reconstruct emails produced discrepancies in counts, internet headers (e.g., subjects, dates, times and time zone offsets, display names, email addresses, missing data like DKIM signatures and message IDs, incorrect content types), body text (including faulty inline CID images), and formatting which made it difficult to compare and determine fidelity to the source
Injecting fixes into .eml files to correct corrupted data was inconsistent and unreliable
LLMs were generally terrible at producing reliable code to anticipate and correct export issues (not to mention often regressing/breaking things that worked earlier).

As I didn't find any clean, reliable way to export Outlook to .eml files, I did what any hacker would: I built my own.

Convert Outlook Using Redemption

This PowerShell code uses the evaluation version of Redemption, a free COM library that provides enhanced access to Outlook/MAPI data. It provides the following functionality:

Ability to export Outlook (classic) messages with fidelity, and with recursive folder creation that mirrors the structure of Outlook's data file (i.e., .pst or IMAP account), to RFC 5322 (.eml) files that can be imported into open-source clients like Thunderbird
Ability to limit the export scope to include:
- An entire (root) .pst
- An individual folder
- An individual message (found via subject and date)
- All messages of a certain class
- A limited number of messages (useful for testing)
Ability to substitute a single email address with another email address in cases where Outlook corruption claims an email was sent from a different address (seemed prevalent with meetings)
Ability to export the proper sender and/or recipient email addresses in cases of corruption (often yielding addresses like name@invalidemail.com, invalidemail@company.com, and nobody@invalid.invalid)
Ability to log progress as the script runs and provide detailed/verbose and summary logs when finished
Ability to name export paths and filenames compatible with Windows limitations
Ability to perform exports on network drives (useful when limited space is available on local drives for large .pst files)

EML Folder Comparison - Folder A vs Folder B

While this step isn't necessary if you haven't used a different tool/process to create .eml files (e.g., drag/drop from Outlook to Windows Explorer), this Python code compares .eml files to determine matches and duplicates, including the following functionality:

Ability to compare all .eml files within the same folder and identify duplicates (e.g., from dragging all Outlook Sent Items into Windows Explorer in small batches and then converting these to .eml files via MSGConvert or readpst produced many files that were almost identical)
Ability to compare all .eml files between two folders (e.g., Folder A containing dragged/dropped files converted to .eml vs. Folder B containing exported messages via PowerShell script) and find matches (or likely matches) using different criteria, including:
- Message ID (if available)
- Subject and date/time concatenated based on normalized time zones
- Body text (configurable based on a user-defined number of characters)
- Subject and date/time concatenated but allowing for a user-defined time range (e.g., if the "drift" is set at 10 minutes, two .eml files will be considered a likely match if the subjects match and the time difference of Folder B's .eml is no more than plus/minus 10 minutes the time of Folder A's .eml)
- Subject, date/time, and sender/recipients concatenated
Ability to move duplicate .eml files to a backup directory
Ability to limit the comparison scope to include a limited number of messages (useful for testing)
Ability to build a cache to reduce re-reun time for different drifts (i.e., useful for reducing the time required for analysis when evaluating/testing results over different time periods)
Ability to log progress as the script runs and provide detailed/verbose and summary logs when finished

Basic requirements of installed software on local Windows machine:

Outlook client
Redemption
Python
PowerShell

General process:

Install software requirements as necessary
Open Outlook (Outlook must be running when the PowerShell script runs; if it isn't, it will load Outlook in the background)
Using a text editor, save the two code blocks below
Open Convert Outlook Using Redemption.ps1 in PowerShell and edit the configuration variables
Press F5 to run the script
Once the run is complete, analyze the log files to identify potential problems (if you re-run the script after testing or fixing problems, either delete the existing output_base_folder or designate a new one)
Only if you're comparing results from two different tools/processes, edit the configuration variables in EML Folder Comparison - Folder A vs Folder B.py
From the command line, run the Python script (e.g., python "EML Folder Comparison - Folder A vs Folder B.py"); on the first run, I do not suggest moving duplicate files (i.e., wait until you've verified those are proper duplicates)
Once the run is complete, analyze the log files to identify potential problems (e.g., use Excel's Data Filter functionality to easily review columns such as Match Status, Match Method, and Confidence in the EML Comparison Report)
Adjust the drift variable if you identify matches that should have been made but were not due to a limited time window (e.g., I found .eml matches with up to a 16-hour time window difference, possibly from causes such as incorrect time on local machine, Outlook corruption, time the Send button was pressed vs. actual time the message left the Outbox, etc.)
If importing into Thunderbird, you can use the ImportExportTools NG add-on which will automatically rebuild the folder structure based on the .eml folder structure

Convert Outlook Using Redemption.ps1

# ==============================================================================
# Script: Convert Outlook Using Redemption.ps1
# Version: 1.1
# ==============================================================================
#
# DESCRIPTION:
#   This script exports Outlook emails to EML format using the Redemption library.
#   It recursively processes all mail folders, exports messages, repairs address
#   corruption, and generates detailed CSV reports with folder statistics.
#
# PREREQUISITES:
#   1. Microsoft Outlook must be installed and configured on this machine
#   2. Redemption library must be installed (see installation instructions below)
#   3. PowerShell must be run with appropriate permissions
#   4. Network paths must be accessible from this machine
#
# ==============================================================================
# REDEMPTION INSTALLATION INSTRUCTIONS
# ==============================================================================
#
# Redemption is a COM library that provides enhanced access to Outlook/MAPI data.
# It is required for this script to function properly.
#
# STEP 1: Download Redemption
# --------------------------------
# Visit the official Dmitry Streble website:
#   https://www.dimastr.com/redemption/
#
# Click on "Download Redemption" and download the latest version.
# As of this writing, the current version is 5.x
#
# STEP 2: Install Redemption
# --------------------------------
# Option A - MSI Installer (Recommended):
#   1. Run the downloaded .msi installer file
#   2. Follow the installation wizard
#   3. Accept the license agreement
#   4. Choose installation directory (default is fine)
#   5. Complete the installation
#
# Option B - Manual Registration (if MSI doesn't work):
#   1. Extract the downloaded archive
#   2. Open Command Prompt as Administrator
#   3. Navigate to the Redemption folder
#   4. Run: regsvr32 Redemption.dll
#   5. You should see "DllRegisterServer succeeded" message
#
# STEP 3: Verify Installation
# --------------------------------
# Open PowerShell and run:
#   $test = New-Object -ComObject "Redemption.RDOSession"
#   $test.GetType().FullName
#
# If this returns "Redemption.RDOSession" without error, installation succeeded.
# If you get an error, Redemption is not properly registered.
#
# STEP 4: Licensing (Important!)
# --------------------------------
# Redemption is commercial software with a free evaluation mode.
#
# Evaluation Mode:
#   - Free to use for testing
#   - Adds "Evaluation Redemption" headers to exported EML files
#   - May display occasional nag dialogs
#   - No expiration date, but not for production use
#
# Licensed Mode:
#   - Purchase a license from https://www.dimastr.com/redemption/pricing.htm
#   - Run the License Manager tool (installed with Redemption)
#   - Enter your license key
#   - Removes evaluation headers and restrictions
#
# For production/enterprise use, a license is required.
#
# STEP 5: Troubleshooting
# --------------------------------
# Common Error: "Retrieving the COM class factory for component with CLSID 
#               {...} failed due to the following error: 80040154"
# Solution:       Redemption is not registered. Re-run Step 2 Option B as Admin.
#
# Common Error: "Redemption.RDOSession" object not found
# Solution:       Ensure you installed the correct version (32-bit vs 64-bit)
#                 to match your Outlook installation.
#
# Common Error: Permission denied when accessing Outlook
# Solution:       Run PowerShell as Administrator. Ensure Outlook is not in
#                 a locked state (close and restart Outlook if needed).
#
# Architecture Matching:
#   - If Outlook is 32-bit, install 32-bit Redemption
#   - If Outlook is 64-bit, install 64-bit Redemption
#   - To check Outlook architecture: File > Office Account > About Outlook
#
# ==============================================================================

$v = "1.1"

# ==============================================================================
# [1] CONFIGURATION VARIABLES
# ==============================================================================

# The name of the Outlook data file/store to process (e.g., "Outlook", "archive@domain.com")
$outlook_datafile = "Outlook"

# The specific folder to start in. Leave as "" to process the entire mailbox (IPMRootFolder).
# If you want a specific subfolder, use the path (e.g., "Inbox\ProjectX").
$outlook_root_folder = ""     

# Filters to restrict which emails are exported. Leave blank ("") to export everything.
$search_subject      = ""
$search_date         = ""

# A safety limit to stop processing after X emails. Set to 0 for unlimited.
$limit_process_count = 0

# Sometimes Outlook meetings/items get corrupted and lose their SMTP addresses.
# If a specific corrupted address is found, it will be replaced with this one during Bulk Repair.
$corrupted_email_address = "admin@company.com"
$replacement_email_address = "another_name@company.com"

# The types of Outlook items we actually care about exporting.
# This ignores things like Sticky Notes (IPM.StickyNote) or internal hidden items.
# Message Class Reference:
#   IPM.Note        = Standard email messages
#   IPM.Report      = Delivery reports and nondelivery reports
#   IPM.Schedule    = Meeting requests and responses
#   IPM.Sharing     = Sharing invitations and notifications
#   IPM.Post        = Forum posts
#   IPM.Recall      = Message recall attempts
#   IPM$            = An IPM message class is technically the root type, and it is
#                     rare for modern emails to just be "IPM" without a sub-type
#                     (like IPM.Note); but it does happen
$target_classes = "IPM.Note|IPM.Report|IPM.Schedule|IPM.Sharing|IPM.Post|IPM.Recall|IPM$"

# Folders to completely skip. If a folder's name matches one of these, 
# it and all its subfolders will be ignored.
# 
# Why exclude these folders?
#   - Contacts/Tasks/Calendar = Not email messages, different item types
#   - Journal/Notes/Memos     = Personal productivity items, not emails
#   - Outbox                  = Contains unsent messages (may cause issues)
#   - RSS Feeds               = Syndication content, not actual emails
#   - Conversation Action Settings = System folder for conversation rules
$excludedFolderNames = @(
    "Contacts",
    "Tasks",
    "Memos",
    "Journal",
    "Notes",
    "Calendar",
    "Outbox",
    "RSS Feeds",
    "Conversation Action Settings"
)

# Base output paths for exports, EML files, and logs.
$output_base_folder = "\\192.168.50.111\Temp\Powerscript_export"
$eml_base = Join-Path $output_base_folder "eml"
$logs_folder = Join-Path $output_base_folder "logs"
$backup_base = Join-Path $logs_folder ".eml_backup"

# File paths for the generated CSV reports and text log.
$details_csv = Join-Path $logs_folder "outlook_export_details_from_version_$v.csv"
$summary_csv = Join-Path $logs_folder "outlook_export_summary_from_version_$v.csv"
$log_file    = Join-Path $logs_folder "debug_log_from_version_$v.txt"

# Windows API path limit constants. We use these to gracefully handle 
# extremely long folder paths which would normally crash standard cmdlets.
# MAX_PATH: We start adding special prefixes when paths approach this length
# UNC_PREFIX: The "\\?\" prefix that enables long path support in Windows API
$MAX_PATH = 240  
$UNC_PREFIX = "\\?\\"  

# ==============================================================================
# [2] INITIALIZATION & GLOBAL STATE
# ==============================================================================

# Load .NET Assembly required for URL Decoding (used in Sanitize-PathSegment)
Add-Type -AssemblyName System.Web
# Ensure our output directories exist
# This prevents errors later when we try to write files to non-existent folders
if (-not (Test-Path $logs_folder)) { New-Item $logs_folder -ItemType Directory -Force | Out-Null }
if (-not (Test-Path $backup_base)) { New-Item $backup_base -ItemType Directory -Force | Out-Null }

# Global counters and collections used to track progress across recursive functions
# These are declared at script scope so all functions can access and update them
$script:fileCounter = 0
$script:totalProcessed = 0
$script:limitReached = $false

# HashSet to prevent exporting the exact same email twice if it is linked in multiple places
# This is important because the same email can appear in multiple folders (e.g., Inbox and Archive)
$script:processedEntryIDs = New-Object System.Collections.Generic.HashSet[string]

# Thread-safe bag to hold folder statistical data (for the summary CSV)
# ConcurrentBag allows safe concurrent writes from recursive function calls
$script:folderData = New-Object System.Collections.Concurrent.ConcurrentBag[Object]

# A list of successfully exported EML files that require the Bulk Repair post-processing pass
# We queue items here during export, then process them all at once after scanning completes
$script:exportQueue = New-Object System.Collections.Generic.List[Object]

# ==============================================================================
# [3] HELPER FUNCTIONS
# ==============================================================================

# Custom logger that writes to both the console and a text file with timestamps
# This provides real-time visibility during execution and permanent records for debugging
function Log($msg) {
    $line = "[$(Get-Date -Format 'yyyy-MM-dd HH:mm:ss')] $msg"
    Write-Host $line
    Add-Content $log_file $line -Encoding UTF8
}

# Scrubs subjects to make them safe for use as file names (removes slashes, quotes, etc.)
# Windows file names cannot contain: \ / : * ? " < > |
# We also limit length to prevent path length issues
function Clean-FileName($text) {
    if ([string]::IsNullOrWhiteSpace($text)) { return "NoSubject" }
    $text = $text -replace "`r", " " -replace "`n", " "
    $clean = $text -replace '[^a-zA-Z0-9\s\._-]', ''
    $clean = ($clean -replace '\s+', ' ').Trim()
    if ([string]::IsNullOrWhiteSpace($clean)) { return "NoSubject" }
    if ($clean.Length -gt 50) { return $clean.Substring(0,50).Trim() }
    return $clean
}

# Scrubs individual folder names to ensure they can be created in the Windows File System
# This is critical because Outlook folder names can contain characters that are invalid in Windows paths
function Sanitize-PathSegment([string]$segment) {
    if ([string]::IsNullOrWhiteSpace($segment)) { return "Empty" }
    
    # 1. Decode URL encoding (e.g., %2F becomes /)
    $seg = [System.Web.HttpUtility]::UrlDecode($segment)
    
    # 2. Replace illegal Windows characters AND Thunderbird-unfriendly characters
    # We replace:
    #   \ / : * ? " < > |  (Windows illegal)
    #   @                  (Thunderbird hierarchy issues)
    #   .                  (Thunderbird hierarchy/extension issues)
    #   %                  (Any remaining encoded junk)
    $seg = $seg -replace '[\\/:*?\"<>|@.%]', '_'
    
    # 3. Clean up whitespace
    $seg = $seg.Trim().TrimEnd('_')
    $seg = ($seg -replace '\s+', ' ').Trim()
    
    # 4. Cap folder name length
    if ($seg.Length -gt 80) { $seg = $seg.Substring(0,80) }
    if ([string]::IsNullOrWhiteSpace($seg)) { return "Empty" }
    
    return $seg
}

# Rebuilds the entire relative path using sanitized folder names
# Takes an Outlook folder path like "Inbox\Project\Subfolder" and makes it filesystem-safe
function Sanitize-RelPathForFS([string]$relPath) {
    if ([string]::IsNullOrWhiteSpace($relPath) -or $relPath -eq "ROOT") { return "ROOT" }
    $parts = $relPath.Split('\')
    $cleanParts = New-Object System.Collections.Generic.List[string]
    foreach ($p in $parts) {
        if ($null -eq $p) { $cleanParts.Add("Empty") | Out-Null }
        else { $cleanParts.Add((Sanitize-PathSegment $p)) | Out-Null }
    }
    return ($cleanParts -join '\')
}

# Prepends the special Windows "\\?\" prefix to paths exceeding ~260 characters
# This tricks the Windows API into allowing paths up to 32,767 characters long.
# Without this, any path over 260 characters will fail with "Path too long" errors.
function Add-LongPathPrefixIfNeeded([string]$path) {
    if ([string]::IsNullOrWhiteSpace($path)) { return $path }
    if ($path.Length -lt $MAX_PATH) { return $path }
    if ($path.StartsWith('\\?\')) { return $path }

    # Handle local drives (e.g., C:\...)
    if ($path -match '^[A-Za-z]:\\') {
        return "\\?\" + $path
    }
    # Handle Network UNC paths (e.g., \\server\share\...)
    # UNC paths require special format: \\?\UNC\server\share\path
    if ($path.StartsWith('\\')) {
        $noLeading = $path.Substring(2) 
        $parts = $noLeading.Split('\')
        if ($parts.Length -ge 2) {
            $server = $parts[0]
            $share  = $parts[1]
            $suffix = ""
            if ($parts.Length -gt 2) { $suffix = "\" + ($parts[2..($parts.Length-1)] -join '\') }
            return "\\?\UNC\$server\$share$suffix"
        }
    }
    return $path
}

# Removes the long-path prefix so normal string manipulation (like Split-Path) works
# Some PowerShell cmdlets don't handle the \\?\ prefix correctly, so we strip it when needed
function Remove-LongPathPrefix([string]$path) {
    if ([string]::IsNullOrWhiteSpace($path)) { return $path }
    if ($path.StartsWith('\\?\UNC\')) { return '\\' + $path.Substring(7) }
    if ($path.StartsWith('\\?\')) { return $path.Substring(4) }
    return $path
}

# Safely gets just the filename at the end of a long path
# Combines Remove-LongPathPrefix with Split-Path for safe filename extraction
function Get-LeafNameSafe([string]$path) {
    $p = Remove-LongPathPrefix $path
    return Split-Path -Path $p -Leaf
}

# Extracts the actual SMTP email address from complex Exchange/X500 objects
# Outlook often stores addresses as Exchange X500 format instead of SMTP
# This function attempts to resolve the actual email address through multiple methods
function Resolve-Address-Fidelity($recip) {
    if ($null -eq $recip) { return "" }
    $addr = ""
    try { $addr = $recip.Address } catch { }
    if ($addr -like "*@*") { return $addr }
    
    # Try grabbing specific MAPI properties if the standard address is X500
    # 0x39FE001F = PR_SMTP_ADDRESS (the actual SMTP address property)
    $smtp = try { $recip.Fields(0x39FE001F) } catch { $null }
    if ($null -ne $smtp -and $smtp -like "*@*") { return $smtp }
    
    # Try proxy addresses (0x800F101F = PR_PROXY_ADDRESSES)
    # This contains all known addresses for the recipient, including SMTP
    $proxies = try { $recip.Fields(0x800F101F) } catch { $null }
    if ($null -ne $proxies) {
        foreach ($p in $proxies) { if ($p -match "^SMTP:(.*)") { return $matches[1] } }
    }
    return $addr
}

# ==============================================================================
# [4] BULK REPAIR FUNCTION
# ==============================================================================
# Sometimes Outlook items (especially Meeting Responses) export as EMLs with 
# non-SMTP addresses (like Exchange X500 addresses) or missing addresses.
# This function re-opens the exported text files and explicitly patches the headers.
# 
# Why is this needed?
#   - Redemption exports preserve the original address format from Outlook
#   - Many email clients and archival systems require proper SMTP addresses
#   - This post-processing step ensures compatibility with other systems
function Invoke-Bulk-Repair {
    $totalQueue = $script:exportQueue.Count
    Log "--- Starting Bulk Post-Processing Repair Phase ($totalQueue files) ---"

    $currentIndex = 0

    foreach ($job in $script:exportQueue) {
        $currentIndex++
        
        # Update UI Progress Bar (throttled to update every 10 items so it doesn't slow down)
        # Progress bars prevent the console from appearing frozen during long operations
        if ($currentIndex % 10 -eq 0 -or $currentIndex -eq $totalQueue) {
            Write-Progress -Activity "Bulk Post-Processing Repair" `
                           -Status "Checking/Repairing EML headers..." `
                           -CurrentOperation "File $currentIndex of $totalQueue" `
                           -PercentComplete (($currentIndex / $totalQueue) * 100)
        }

        $emlPath = $job.Path
        $item = $job.Item
        $relPath = $job.RelPath

        if (-not (Test-Path -LiteralPath $emlPath)) {
            Log "      [!] Skip Repair: File not found. Checked Path: >>$emlPath<<"
            continue
        }

        try {
            $repairs = @()
            
            # 1. Check Sender Address
            $sName = ""; try { $sName = $item.SenderName } catch { }
            $sAddr = ""; try { $sAddr = $item.SenderEmailAddress } catch { }
            
            if ($sAddr -notlike "*@*") {
                Log "      [WARN] Address Found: (Non-SMTP) Sender: $sName ($sAddr) | Path: $emlPath"
                $realS = Resolve-Address-Fidelity $item.Sender
                if ($realS -like "*@*") {
                    if (![string]::IsNullOrEmpty($corrupted_email_address) -and $realS -eq $corrupted_email_address) {
                        Log "      [WARN] Resolved address matched CORRUPTED variable ($realS). Substituting: $replacement_email_address"
                        $realS = $replacement_email_address
                    }
                    $repairs += @{ "TargetName" = $sName; "OldAddr" = $sAddr; "NewAddr" = $realS; "Type" = "Sender" }
                }
            }

            # 2. Check Recipient Addresses
            foreach ($recip in $item.Recipients) {
                $rName = $recip.Name; $rAddr = ""; try { $rAddr = $recip.Address } catch { }
                if ($rAddr -notlike "*@*") {
                    if ($recip.DisplayType -eq 1 -or $recip.DisplayType -eq 5) {
                        Log "      [WARN] Address Found: EXPANDED GROUP: $rName | Path: $emlPath"
                    } else {
                        Log "      [WARN] Address Found: (Non-SMTP) Recipient: $rName ($rAddr) | Path: $emlPath"
                        $realR = Resolve-Address-Fidelity $recip
                        if ($realR -like "*@*" -and $realR -ne $rAddr) {
                            if (![string]::IsNullOrEmpty($corrupted_email_address) -and $realR -eq $corrupted_email_address) {
                                Log "      [WARN] Resolved address matched CORRUPTED variable ($realR). Substituting: $replacement_email_address"
                                $realR = $replacement_email_address
                            }
                            $repairs += @{ "TargetName" = $rName; "OldAddr" = $rAddr; "NewAddr" = $realR; "Type" = "Recipient" }
                        }
                    }
                }
                [Runtime.InteropServices.Marshal]::ReleaseComObject($recip) | Out-Null
            }

            # 3. Read and patch the text of the EML file
            $lines = Get-Content -LiteralPath $emlPath -ErrorAction Stop
            $hasAddressRepairs = $false; $hasHeaderCleanup = $false; $headerFinished = $false
            $newContent = New-Object System.Collections.Generic.List[string]

            foreach ($line in $lines) {
                # Stop parsing once we hit the blank line separating headers from body
                # EML format: headers first, blank line, then message body
                if (-not $headerFinished -and [string]::IsNullOrWhiteSpace($line)) { $headerFinished = $true }

                if (-not $headerFinished) {
                    # Remove Redemption evaluation tags
                    # These appear in unlicensed versions and may interfere with email clients
                    if ($line -match "^X-mailer: Evaluation Redemption MIME converter") { $hasHeaderCleanup = $true; continue }
                    
                    $updatedLine = $line
                    
                    # If we need to fix an address in the headers
                    if ($repairs.Count -gt 0 -and $line -match "^(From|To|Cc|Reply-To):") {
                        foreach ($rep in $repairs) {
                            $m = $false
                            if (-not [string]::IsNullOrEmpty($rep.OldAddr) -and $updatedLine.Contains($rep.OldAddr)) {
                                $updatedLine = $updatedLine -replace [regex]::Escape($rep.OldAddr), $rep.NewAddr
                                $m = $true
                            } elseif ($updatedLine.Contains($rep.TargetName) -and $updatedLine -notlike "*$($rep.NewAddr)*") {
                                $updatedLine = $updatedLine -replace [regex]::Escape($rep.TargetName), "$($rep.TargetName) <$($rep.NewAddr)>"
                                $m = $true
                            }
                            if ($m) { $hasAddressRepairs = $true; Log "      [WARN] Repaired $($rep.Type): $($rep.TargetName) -> $($rep.NewAddr) | Path: $emlPath" }
                        }
                    }
                    $newContent.Add($updatedLine)
                } else { 
                    $newContent.Add($line) 
                }
            }

            # 4. Save the patched file back out (and make a backup of the original just in case)
            if ($hasAddressRepairs -or $hasHeaderCleanup) {
                if ($hasAddressRepairs) {
                    $targetBackupDir = Join-Path $backup_base $relPath
                    if (-not (Test-Path $targetBackupDir)) { New-Item $targetBackupDir -ItemType Directory -Force | Out-Null }
                    $backupPath = Join-Path $targetBackupDir (Get-LeafNameSafe $emlPath)
                    if (-not (Test-Path -LiteralPath $backupPath)) { Copy-Item -LiteralPath $emlPath -Destination $backupPath -Force; Log "      [BACKUP] Original saved to: $backupPath" }
                }
                $newContent | Set-Content -LiteralPath $emlPath -Encoding UTF8
            }
        } catch {
            Log "      [!] Bulk Repair Error: $emlPath | $($_.Exception.Message)"
        }
        
        # Release the COM item from memory now that we are completely done with it
        # This is critical to prevent memory leaks and Outlook hanging
        [Runtime.InteropServices.Marshal]::ReleaseComObject($item) | Out-Null
    }

    # Hide the progress bar
    Write-Progress -Activity "Bulk Post-Processing Repair" -Completed
}

# ==============================================================================
# [5] MESSAGE EXPORT FUNCTION
# ==============================================================================
function Process-Message-Redemption($item, $relPath) {
    if ($script:limitReached) { return $false }

    try {
        # Unique ID to avoid exporting duplicates (e.g., items linked in multiple folders)
        # EntryID is Outlook's unique identifier for each item
        $eid = try { $item.EntryID } catch { [guid]::NewGuid().ToString() }
        if ($script:processedEntryIDs.Contains($eid)) { return $false }

        # Extract dates and subject
        $subj = $item.Subject
        $dtObj = $item.SentOn

        $dtText = "N/A"
        if ($null -ne $dtObj) { $dtText = $dtObj.ToString("ddd dd/MM/yyyy HH:mm") }

        $dtISOFull = "0000-00-00 00:00:00"
        if ($null -ne $dtObj) { $dtISOFull = $dtObj.ToString("yyyy-MM-dd HH:mm:ss") }

        $dtISODate = "0000-00-00"
        if ($null -ne $dtObj) { $dtISODate = $dtObj.ToString("yyyy-MM-dd") }

        # Apply user filters (Date/Subject)
        # These allow targeted exports without processing the entire mailbox
        if (![string]::IsNullOrWhiteSpace($search_subject) -and $subj -ine $search_subject.Trim()) { return $false }
        if (![string]::IsNullOrWhiteSpace($search_date) -and $dtISODate -ne $search_date) { return $false }

        # Mark as processed
        $script:processedEntryIDs.Add($eid) | Out-Null
        $script:fileCounter++

        # Create a filesystem-safe directory path based on the Outlook folder hierarchy
        $fsRelPath = Sanitize-RelPathForFS $relPath
        $shortFsRelPath = $fsRelPath
        
        # Aggressively shorten path if it's too long
        # This is a fallback when folder names themselves create paths that are too long
        if ($shortFsRelPath.Length -gt 100) {
            $parts = $shortFsRelPath -split '\\'
            if ($parts.Count -ge 2) {
                $shortFsRelPath = $parts[0] + "\" + $parts[-1]
                if ($shortFsRelPath.Length -gt 100) {
                    $shortFsRelPath = $parts[0].Substring(0, [Math]::Min(50, $parts[0].Length)) + "\" +
                                       $parts[-1].Substring(0, [Math]::Min(45, $parts[-1].Length))
                }
            } else {
                $shortFsRelPath = $parts[0].Substring(0, [Math]::Min(100, $parts[0].Length))
            }
        }

        $targetEmlDirPlain = Join-Path $eml_base $shortFsRelPath
        $targetEmlDir = Add-LongPathPrefixIfNeeded $targetEmlDirPlain

        # Attempt to create the output folder. Fall back to a "LongPaths" folder if creation fails.
        # Multiple try/catch layers ensure we always have somewhere to write the file
        try {
            $needsCreate = $false
            try {
                if (-not (Test-Path -LiteralPath $targetEmlDir)) { $needsCreate = $true }
            } catch {
                $needsCreate = $true
            }

            if ($needsCreate) {
                try {
                    New-Item -ItemType Directory -Force -Path $targetEmlDir | Out-Null
                } catch {
                    $fallbackPlain = Join-Path $eml_base "LongPaths"
                    $fallbackDir = Add-LongPathPrefixIfNeeded $fallbackPlain
                    if (-not (Test-Path -LiteralPath $fallbackDir)) {
                        New-Item -ItemType Directory -Force -Path $fallbackDir | Out-Null
                    }
                    $targetEmlDirPlain = $fallbackPlain
                    $targetEmlDir = $fallbackDir
                    $shortFsRelPath = "LongPaths"
                }
            }
        } catch {
            $fallbackPlain = Join-Path $eml_base "LongPaths"
            $fallbackDir = Add-LongPathPrefixIfNeeded $fallbackPlain
            if (-not (Test-Path -LiteralPath $fallbackDir)) {
                New-Item -ItemType Directory -Force -Path $fallbackDir | Out-Null
            }
            $targetEmlDirPlain = $fallbackPlain
            $targetEmlDir = $fallbackDir
            $shortFsRelPath = "LongPaths"
        }

        # Build the final EML file name
        # Format: YYYY-MM-DD_Subject_000001.eml
        $cleanSubject = Clean-FileName $subj
        $fName = "$dtISODate`_$cleanSubject`_$($script:fileCounter.ToString("D6")).eml"

        if ($fName.Length -gt 100) {
            $fName = $fName.Substring(0,100) + ".eml"
        }

        $emlPathPlain = Join-Path $targetEmlDirPlain $fName
        $emlPath = Add-LongPathPrefixIfNeeded $emlPathPlain

        Log "[$($script:fileCounter.ToString("D6"))] Exporting ($($item.MessageClass)): $subj ($emlPath)"
        
        # SaveAs 1024 is the Redemption format for EML.
        # Format code 1024 = olMIME (MIME format, which is EML)
        $item.SaveAs($emlPath, 1024)

        # Queue it up for Post-Processing bulk repair
        # We keep the COM object reference alive until repair is complete
        $script:exportQueue.Add(@{
            "Path" = $emlPath
            "Item" = $item
            "RelPath" = $shortFsRelPath
        }) | Out-Null

        # Write to the Details CSV
        # Includes a clickable hyperlink to open the EML file directly from Excel
        $csvLine = """$dtText"",""$dtISOFull"",""$($subj -replace '"','''')"",""=HYPERLINK(""""$emlPath"""",""""Open"""")"""
        Add-Content $details_csv $csvLine -Encoding UTF8

        if ($limit_process_count -gt 0 -and $script:fileCounter -ge $limit_process_count) {
            Log "!!! LIMIT REACHED ($limit_process_count items). Stopping export scan. !!!"
            $script:limitReached = $true
        }

        return $true
    } catch {
        Log "      [!] ERROR on item $script:fileCounter: $($_.Exception.Message)"
        return $false
    }
}

# ==============================================================================
# [6] RECURSIVE FOLDER FUNCTION
# ==============================================================================

# Determines if a folder should be completely skipped based on its name or path
function Is-FolderExcluded([object]$folder) {
    try {
        $name = $folder.Name
        if ($null -ne $name -and $excludedFolderNames -contains $name) { return $true }
    } catch { }

    # Also check the folder's full path just in case
    # This catches folders that might be nested under different parents
    try {
        $fp = $folder.FolderPath
        if ($null -ne $fp) {
            foreach ($ex in $excludedFolderNames) {
                # Uses regex to match the exact folder name bounded by slashes or end of string
                if ($fp -match "(?i)(^|\\\\)$([regex]::Escape($ex))($|\\\\)") { return $true }
            }
        }
    } catch { }

    return $false
}

function Process-Folder-Recursive($folder, $mailboxRootPath) {
    # Calculate the relative path from the root of the mailbox
    # This creates the folder structure for our export directories
    $rel = $folder.FolderPath.Replace($mailboxRootPath, "").TrimStart('\')
    if ([string]::IsNullOrEmpty($rel)) { $rel = "ROOT" }

    # Skip if it's on the exclusion list (like Contacts, Tasks, etc.)
    if (Is-FolderExcluded $folder) {
        return $null
    }

    $classKeys = $target_classes.Split('|')

    # Setup statistical counters for this specific folder
    # We track both local items and rolled-up totals from subfolders
    $stats = @{
        Path = $rel
        Total = [int64]0
        Classes = @{}
    }
    foreach ($c in $classKeys) { $stats["Classes"][$c] = [int64]0 }

    # ------------------------------------------------------------------------
    # STEP A: Local Folder Scan (Scan STANDARD items)
    # ------------------------------------------------------------------------
    $items = $null
    try {
        $items = $folder.Items
        $itemCount = $items.Count
        Log "--- Scanning Folder: $($folder.FolderPath) (Standard Items: $itemCount) ---"

        for ($i = 1; $i -le $itemCount; $i++) {
            if ($script:limitReached) { break }

            # Update UI Progress Bar (throttled)
            if ($i % 25 -eq 0 -or $i -eq $itemCount) {
                Write-Progress -Activity "Scanning Outlook Folders" `
                               -Status "Folder: $rel" `
                               -CurrentOperation "Processing item $i of $itemCount" `
                               -PercentComplete (($i / $itemCount) * 100)
            }

            $item = $null
            try {
                $item = $items.Item($i)

                # Only process items matching our target classes (e.g., IPM.Note)
                if ($item.MessageClass -match $target_classes) {
                    if (Process-Message-Redemption $item $rel) {
                        # Increment counters on successful export
                        $stats["Total"] = [int64]$stats["Total"] + 1
                        foreach ($c in $classKeys) {
                            if ($item.MessageClass -match [regex]::Escape($c)) {
                                $stats["Classes"][$c] = [int64]$stats["Classes"][$c] + 1
                                break
                            }
                        }
                    }
                }
            } catch {
                # Ignore isolated failures on individual items, keep looping
            } 
            # Note: We do not ReleaseComObject($item) here. The Bulk Repair phase needs it active in memory.
        }
    } catch {
        Log "      [WARN] Failed counting items in folder '$($folder.FolderPath)': $($_.Exception.Message)"
    } finally {
        try { if ($items) { [Runtime.InteropServices.Marshal]::ReleaseComObject($items) | Out-Null } } catch {}
    }

    Write-Progress -Activity "Scanning Outlook Folders" -Completed

    # ------------------------------------------------------------------------
    # STEP A.2: Local Folder Scan (Scan HIDDEN/ASSOCIATED items)
    # ------------------------------------------------------------------------
    $hiddenItems = $null
    try {
        $hiddenItems = $folder.HiddenItems
        $hiddenCount = $hiddenItems.Count
        
        if ($hiddenCount -gt 0) {
            Log "--- Scanning Hidden Items: $($folder.FolderPath) (Count: $hiddenCount) ---"
            
            for ($i = 1; $i -le $hiddenCount; $i++) {
                if ($script:limitReached) { break }

                # Update UI Progress
                if ($i % 25 -eq 0 -or $i -eq $hiddenCount) {
                    Write-Progress -Activity "Scanning Hidden Items" `
                                   -Status "Folder: $rel" `
                                   -CurrentOperation "Processing hidden item $i of $hiddenCount" `
                                   -PercentComplete (($i / $hiddenCount) * 100)
                }

                $item = $null
                try {
                    $item = $hiddenItems.Item($i)
                    
                    # Process hidden items using the SAME logic as standard items
                    if ($item.MessageClass -match $target_classes) {
                        if (Process-Message-Redemption $item $rel) {
                            $stats["Total"] = [int64]$stats["Total"] + 1
                            # (Optional: might want to track these separately in stats)
                        }
                    }
                } catch { 
                    # Ignore errors on hidden items
                }
            }
        }
    } catch {
        Log "      [WARN] Failed scanning hidden items: $($_.Exception.Message)"
    } finally {
        try { if ($hiddenItems) { [Runtime.InteropServices.Marshal]::ReleaseComObject($hiddenItems) | Out-Null } } catch {}
    }

    # ------------------------------------------------------------------------
    # STEP B: Recurse into Subfolders
    # ------------------------------------------------------------------------
    
    # We load subfolders into a local List first (snapshot). 
    # If we iterate over the raw COM collection directly, Outlook can throw a 
    # "Collection was modified" error if an item synchronizes in the background.
    $subFolders = $null
    $subFolderList = New-Object System.Collections.Generic.List[object]
    try {
        $subFolders = $folder.Folders
        $subCount = $subFolders.Count
        for ($j = 1; $j -le $subCount; $j++) {
            try { $subFolderList.Add($subFolders.Item($j)) | Out-Null } catch { }
        }
    } catch {
        Log "      [WARN] Failed enumerating subfolders of '$($folder.FolderPath)': $($_.Exception.Message)"
    } finally {
        try { if ($subFolders) { [Runtime.InteropServices.Marshal]::ReleaseComObject($subFolders) | Out-Null } } catch {}
    }

    # Now recurse over our safe snapshot list
    $childResults = New-Object System.Collections.Generic.List[object]
    foreach ($sub in $subFolderList) {
        try {
            $child = Process-Folder-Recursive $sub $mailboxRootPath
            if ($null -ne $child) { $childResults.Add($child) | Out-Null }
        } catch {
            Log "      [ERROR] Failed to process subfolder $($sub.FolderPath): $($_.Exception.Message)"
        } finally {
            try { [Runtime.InteropServices.Marshal]::ReleaseComObject($sub) | Out-Null } catch {}
        }
    }

    # ------------------------------------------------------------------------
    # STEP C: Rollup Metrics
    # ------------------------------------------------------------------------
    # Add all child statistics into our parent totals so the CSV correctly shows 
    # "SubFoldersTotalUnderneath".
    # This allows you to see not just what's in each folder, but what's in all
    # descendant folders as well.
    
    $rollupTotal = [int64]$stats["Total"]
    $rollupClasses = @{}
    foreach ($c in $classKeys) { $rollupClasses[$c] = [int64]$stats["Classes"][$c] }

    foreach ($child in $childResults) {
        $rollupTotal += [int64]$child.RollupTotal
        foreach ($c in $classKeys) {
            $rollupClasses[$c] += [int64]$child.RollupClasses[$c]
        }
    }

    # Create the data object representing this folder's final state
    $folderObj = [PSCustomObject]@{
        Path = $rel
        LocalTotal = [int64]$stats["Total"]
        RollupTotal = [int64]$rollupTotal
        LocalClasses = $stats["Classes"]
        RollupClasses = $rollupClasses
    }

    # Add it to the concurrent bag (we add it even if it has 0 items so it shows in the report)
    # This ensures empty folders are still documented in the summary CSV
    $script:folderData.Add($folderObj)
    return $folderObj
}

# ==============================================================================
# [7] MAIN EXECUTION BLOCK
# ==============================================================================
try {
    # Initialize Headers for the Details CSV
    # This CSV contains one row per exported email with clickable links
    "Date_Text,Date_ISO,Subject,Hyperlink" | Out-File $details_csv -Encoding UTF8

    # Start Outlook Application and Redemption session via COM
    # This is where the script connects to Outlook through Redemption
    Write-Progress -Activity "Initializing" -Status "Connecting to Outlook..."
    $rSession = New-Object -ComObject "Redemption.RDOSession"
    $Outlook = New-Object -ComObject Outlook.Application
    $Namespace = $Outlook.GetNamespace("MAPI")

    # Connect Redemption to the existing Outlook session
    # This avoids creating a second Outlook instance
    $rSession.MAPIOBJECT = $Namespace.MAPIOBJECT
    $Mailbox = $rSession.Stores.Item($outlook_datafile)

    $mailboxRootPath = $Mailbox.IPMRootFolder.FolderPath
    $targetRoot = $Mailbox.IPMRootFolder

    # Navigate to a specific subfolder if the user specified one in the config
    # Allows targeted exports of specific folder trees
    if (![string]::IsNullOrWhiteSpace($outlook_root_folder)) {
        foreach ($p in $outlook_root_folder.Split("\")) {
            $targetRoot = $targetRoot.Folders.Item($p)
        }
    }
    Write-Progress -Activity "Initializing" -Completed

    # 1. Start the recursive extraction process
    # This walks the entire folder tree and exports matching items
    Process-Folder-Recursive $targetRoot $mailboxRootPath | Out-Null

    # 2. Run post-processing on the generated text files to fix addresses
    # Repairs any non-SMTP addresses in the exported EML files
    Invoke-Bulk-Repair

    # 3. Generate the Summary CSV containing rolled-up statistics
    # This provides a folder-by-folder breakdown of what was exported
    Write-Progress -Activity "Generating Summary CSV" -Status "Calculating final rollups..." -PercentComplete 50
    $headers = "FolderPath,TotalMessagesInFolder,SubFoldersTotalUnderneath"
    foreach ($c in $target_classes.Split('|')) { $headers += ",$c,${c}_TotalUnderneath" }
    $headers | Out-File $summary_csv -Encoding UTF8

    # Convert the Thread-safe bag to an array and sort it alphabetically by path
    $reportData = $script:folderData.ToArray() | Sort-Object Path
    foreach ($f in $reportData) {
        # Total in descendant subfolders is (RollupTotal - LocalTotal)
        $row = """$($f.Path)"",$($f.LocalTotal),$($f.RollupTotal - $f.LocalTotal)"
        foreach ($c in $target_classes.Split('|')) {
            $row += ",$($f.LocalClasses[$c]),$($f.RollupClasses[$c] - $f.LocalClasses[$c])"
        }
        $row | Add-Content $summary_csv
    }
    Write-Progress -Activity "Generating Summary CSV" -Completed

    Log "=== SUCCESS ==="
} catch {
    Log "FATAL ERROR: $($_.Exception.Message)"
    Log "Stack Trace: $($_.ScriptStackTrace)"
} finally {
    # ------------------------------------------------------------------------
    # CLEANUP
    # ------------------------------------------------------------------------
    # Explicitly release COM objects so Outlook doesn't hang in the background
    # This is critical - failing to release COM objects can leave Outlook processes running
    if ($rSession) { [Runtime.InteropServices.Marshal]::ReleaseComObject($rSession) | Out-Null }
    if ($Outlook) { [Runtime.InteropServices.Marshal]::ReleaseComObject($Outlook) | Out-Null }
    if ($Namespace) { [Runtime.InteropServices.Marshal]::ReleaseComObject($Namespace) | Out-Null }
    
    # Ensure any stray progress bars are cleared
    Write-Progress -Activity "Scanning Outlook Folders" -Completed -ErrorAction SilentlyContinue
    Write-Progress -Activity "Bulk Post-Processing Repair" -Completed -ErrorAction SilentlyContinue
}

EML Folder Comparison - Folder A vs Folder B.py

#!/usr/bin/env python3

# -- coding: utf-8 --
"""
Outlook EML Comparison (Version 1.1)
Purpose:
Compares two folders of .eml email files (Folder A and Folder B)
to find matching messages using multiple matching strategies:
- Exact Message-ID
- Date + Subject (with/without prefix stripping)
- Drift-tolerant Date + Subject (accounts for clock skew)
- From/To recipient signature strengthening
- Body snippet hashing (SHA1 of first N characters of plain text)

text

Outputs detailed CSV reports and logs.

Designed to compare Outlook exports from different sources as different
export methods sometimes yield different results (e.g., drag and drop
from original Outlook PST to Windows Explorer and then convert to .eml
vs. Redemption PowerShell export).
Usage:
Configure the constants at the top (FOLDER_A_PATH, FOLDER_B_PATH, etc.)
then run: python eml_comparison.py

Author notes:
- UNC paths (\\\\server\\\\share) are fully supported.
- Metadata caching speeds up repeated runs.
- Duplicate detection in Folder A can optionally move duplicates.
"""

import os
import re
import sys
import csv
import time
import pickle
import hashlib
import logging
import shutil
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from email import policy
from email.parser import BytesParser
from email.header import decode_header, make_header
from email.utils import parsedate_to_datetime, getaddresses

# ==============================================================================
# [1] CONFIGURATION SECTION
# ==============================================================================
# All tunable parameters are here. Change these before running the script.
VERSION = "1.1" # Script version ? appears in output filenames

# Network/UNC paths to folders containing .eml files (recursively scanned).
# Path to Folder A (The Source - Dragged/Dropped files)
FOLDER_A_PATH = r"\\192.168.50.111\Temp\DragDrop\Sent Items"

# Path to Folder B (The Target - Automation export)
FOLDER_B_PATH = r"\\192.168.50.111\Temp\Powerscript_export\eml\Sent Items"

# Output directory for CSVs, logs, cache, and moved duplicates.
OUTPUT_DIR = r"\\192.168.50.111\Temp\Powerscript_export\Logs"

# Drift tolerance in minutes ? accounts for clock skew between systems
# when comparing email timestamps. A message sent at 10:02 in Folder A
# will match a message sent at 10:05 in Folder B if DRIFT_WINDOW_MINUTES >= 3.
DRIFT_WINDOW_MINUTES = 960

# Body snippet matching: uses SHA1 hash of the first N characters of the
# email body (plain text preferred, HTML stripped). Helps match emails
# where dates differ but content is identical.
USE_BODY_SNIPPET_MATCHING = True
BODY_SNIPPET_MATCHING_CHARS = 400 # How many chars of body to hash
BODY_SNIPPET_READ_BYTES_LIMIT = 2_000_000 # Safety cap to prevent MemoryError

# From/To strengthening: when matching keys, include sender + recipient
# signature. Falls back to Cc/Bcc if To is empty (common for sent items).
USE_FROM_TO_IN_KEYS = True

# Date_Compare logic explained:
# - "none": use raw wall-clock time (offset ignored)
# - "subtract_abs_offset": wall-clock minus absolute timezone offset
# This normalizes timezone differences while preserving DST differences.
DATE_COMPARE_A_MODE = "subtract_abs_offset"
DATE_COMPARE_B_MODE = "none"

# Recipient list truncation for output (purely cosmetic, avoids huge CSV cells).
OUTPUT_RECIPIENT_LIST_MAX_CHARS = 250

# Duplicate handling in Folder A:
# If True, detect and optionally move duplicate .eml files in Folder A.
DETECT_DUPLICATES_IN_A = True
EXCLUDE_DUPLICATES_IN_A_FROM_MATCHING_DEFAULT = False

# When bucketing duplicates, include From + recipient signature in the key?
# Makes duplicate detection stricter (recommended: True).
DUPLICATE_BUCKET_INCLUDE_FROM_AND_RECIPIENT_SIG = True

# Memory/performance read caps for metadata parsing.
# Set to 0 to read entire file. Use >0 to limit bytes read for speed.
READ_BYTES_LIMIT = 0
HEADER_READ_BYTES_LIMIT = 0 # Reserved for future header-only parsing

# Test limiting ? set >0 to process only first N files (for debugging).
# 0 or negative means "process all".
MAX_FILES_TO_PROCESS_A = 0
MAX_FILES_TO_PROCESS_B = 0

# Metadata caching speeds up reruns when files haven't changed.
USE_METADATA_CACHE = True
CACHE_FORCE_REBUILD = False # Set True to ignore existing cache
CACHE_SCHEMA_VERSION = "1" # Bump when cache format changes

# Cache directory name (subfolder of OUTPUT_DIR).
CACHE_DIR_NAME = "cache"

# Persist body signatures for Folder A between runs (drift-independent).
USE_BODY_SIG_CACHE_PERSIST = True
USE_DUPLICATE_CACHE = True # Persist duplicate detection results

# Output filename suffixes include version and drift window for traceability.
DRIFT_SUFFIX = f"_from_version{VERSION}drift_at{DRIFT_WINDOW_MINUTES}_min"

LOG_FILENAME = f"EML_Comparison{DRIFT_SUFFIX}.log"
CSV_FILENAME = f"EML_Comparison_Report{DRIFT_SUFFIX}.csv"
DUPLICATES_A_CSV_FILENAME = f"EML_Comparison_Duplicates_In_A{DRIFT_SUFFIX}.csv"

INVENTORY_A_CSV_FILENAME = f"EML_Comparison_Inventory_A{DRIFT_SUFFIX}.csv"
INVENTORY_B_CSV_FILENAME = f"EML_Comparison_Inventory_B{DRIFT_SUFFIX}.csv"

UNMATCHED_B_CSV_FILENAME = f"EML_Comparison_Unmatched_B{DRIFT_SUFFIX}.csv"

# ==============================================================================
# [2] DATA STRUCTURES
# ==============================================================================
@dataclass
class MsgInfo:
    """
    Holds parsed metadata for a single .eml file.

    text

    Fields:
        folder_tag:       "A" or "B" (which folder this came from)
        path:             Full filesystem path to the .eml file
        message_id:       Normalized Message-ID header (without <> brackets)
        
        date_utc:         Parsed datetime in UTC (None if unparseable)
        date_wallclock:   Datetime without timezone (for drift matching)
        date_compare:     Normalized datetime used for matching keys
        
        *_minute:         String keys rounded to minute precision (for bucketing)
        
        subject_norm:     Full normalized subject (lowercase, whitespace cleaned)
        subject_norm_noprefix: Subject with "Re:"/"Fw:" prefixes stripped
        
        from_norm:        Normalized sender address(es)
        to_norm, cc_norm, bcc_norm: Normalized recipient addresses
        
        recipient_for_key_norm:  String used for recipient-based matching keys
        recipient_for_key_sig:   SHA1 hash of recipient_for_key_norm
        to_trunc_for_output:     Truncated To field for CSV output
        
        body_sig:         SHA1 hash of body snippet (empty if not computed)
        date_parse_note:  Status note from date parsing ("OK", "NoDateHeader", etc.)
    """
    folder_tag: str
    path: str

    message_id: str

    date_utc: datetime | None
    date_wallclock: datetime | None
    date_compare: datetime | None

    date_utc_minute: str
    date_wallclock_minute: str
    date_compare_minute: str

    subject_norm: str
    subject_norm_noprefix: str

    from_norm: str
    to_norm: str
    cc_norm: str
    bcc_norm: str

    recipient_for_key_norm: str
    recipient_for_key_sig: str
    to_trunc_for_output: str

    body_sig: str
    date_parse_note: str
# ==============================================================================
# [3] GLOBAL BODY SIGNATURE CACHE
# ==============================================================================
# IMPORTANT: This dict is mutated (updated/cleared) but never re-bound.
# Re-binding would cause Python to treat it as a local variable inside main().
BODY_SIG_CACHE: dict[str, str] = {}

# ==============================================================================
# [4] LOGGING SETUP
# ==============================================================================
def ensure_output_dir(path: str) -> None:
    """Create output directory if it doesn't exist."""
    os.makedirs(path, exist_ok=True)

def setup_logging(log_path: str) -> None:
    """
    Configure logging to file only.

    Console progress is handled separately by progress_update to keep
    a static display and prevent terminal scrolling.
    """
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s [%(levelname)s] %(message)s",
        handlers=[
            logging.FileHandler(log_path, encoding="utf-8"),
        ],
    )
def log_and_print(msg: str) -> None:
    """Convenience wrapper – logs at INFO level (file only; console uses static progress)."""
    logging.info(msg)

def progress_update(label: str, current: int, total: int, start_ts: float, path: str) -> None:
    """Update a single-line static progress display on stdout."""
    elapsed = int(time.time() - start_ts)
    pct = (current/total*100) if total else 100
    disp = path
    if len(disp) > 70:
        disp = "..." + disp[-67:]
    line = f"{label} {current}/{total} ({pct:5.1f}%) {elapsed}s | {disp}"
    sys.stdout.write("\
" + line.ljust(120))
    sys.stdout.flush()
    if current >= total:
        sys.stdout.write("\
")
        sys.stdout.flush()

# ==============================================================================
# [5] FILE ENUMERATION
# ==============================================================================
def iter_eml_files(root: str, max_files: int) -> list[str]:
    """
    Recursively find all .eml files under root.

    text

    Args:
        root:      Starting directory path
        max_files: Stop after finding this many (0 = unlimited)

    Returns:
        List of full paths to .eml files.
    """
    results: list[str] = []
    for dirpath, _, filenames in os.walk(root):
        for fn in filenames:
            if fn.lower().endswith(".eml"):
                results.append(os.path.join(dirpath, fn))
                if max_files and max_files > 0 and len(results) >= max_files:
                    return results
    return results
# ==============================================================================
# [6] NORMALIZATION UTILITIES
# ==============================================================================
# These functions clean and standardize email fields so that minor differences
# (whitespace, case, encoding) don't prevent matching.
RE_MULTISPACE = re.compile(r"\s+") # Matches one or more whitespace characters

def norm_whitespace(s: str) -> str:
    """
    Collapse all whitespace (including Unicode non-breaking spaces)
    into a single ASCII space, then trim.
    """
    s = (s or "").replace("\u00a0", " ") # Replace non-breaking space
    s = RE_MULTISPACE.sub(" ", s) # Collapse multiple spaces
    return s.strip()

def decode_mime_header(value: str) -> str:
    """
    Decode MIME encoded-word headers (e.g., =?utf-8?B?...?=).

    text

    Uses email.header.make_header which properly handles:
    - Base64 encoded words
    - Quoted-printable encoded words
    - Mixed encodings in a single header

    Falls back to raw value on failure.
    """
    if not value:
        return ""
    try:
        decoded = str(make_header(decode_header(value)))
        return decoded
    except Exception:
        return value
def normalize_subject(raw_subject: str) -> tuple[str, str]:
    """
    Normalize an email subject line.

    text

    Returns:
        (full_normalized, without_prefix)
        
    - full_normalized: lowercase, whitespace cleaned, dash normalization
    - without_prefix:  additionally strips "Re:" / "Fw:" / "Fwd:" prefixes
    """
    s = decode_mime_header(raw_subject or "")
    s = s.replace("\
", "").replace("\
", " ")
    s = norm_whitespace(s)
    # Normalize en-dash and em-dash to regular hyphen
    s = s.replace("\u2013", "-").replace("\u2014", "-")
    # NEW: normalize curly quotes and other typographic punctuation to ASCII
    s = s.replace("\u2018", "'").replace("\u2019", "'").replace("\u201A", "'").replace("\u201B", "'")
    s = s.replace("\u201C", '"').replace("\u201D", '"').replace("\u201E", '"').replace("\u201F", '"')
    s = s.replace("\u00AB", '"').replace("\u00BB", '"')
    s = s.replace("\u2039", "'").replace("\u203A", "'")
    s = s.replace("\u2026", "...")

    noprefix = s
    while True:
        # Strip leading "Re:", "Fw:", "Fwd:" (case-insensitive, with optional spaces)
        new = re.sub(r"^(?:\s*(re|fw|fwd)\s*:\s*)", "", noprefix, flags=re.IGNORECASE)
        if new == noprefix:
            break
        noprefix = norm_whitespace(new)

    return s.lower(), noprefix.lower()
def truncate_for_output(s: str, max_chars: int) -> str:
    """
    Truncate a string for CSV output display purposes.
    Appends "?" if truncated.
    """
    s = s or ""
    if max_chars <= 0 or len(s) <= max_chars:
        return s
    return s[:max_chars] + "?"

def safe_lower(s: str) -> str:
    """Trim and lowercase a string safely (handles None)."""
    return (s or "").strip().lower()

def normalize_addresses(header_value: str) -> str:
    """
    Convert an RFC 2822 address header (From, To, Cc, Bcc) into a
    canonical string for comparison.

    text

    Rules:
    - Decode MIME encoding first
    - Extract email addresses using email.utils.getaddresses
    - Prefer valid-looking emails (contains "@" and "." after "@")
    - Fall back to display name if no valid email
    - Lowercase everything
    - Join multiple addresses with ";"

    Example:
        Input:  "John Doe <john@example.com>, Jane <jane@test.org>"
        Output: "john@example.com;jane@test.org"
    """
    if not header_value:
        return ""
    header_value = decode_mime_header(header_value)
    addrs = getaddresses([header_value])  # Returns list of (display_name, email)
    canon: list[str] = []
    for display, email_addr in addrs:
        display = norm_whitespace(display or "")
        email_addr = safe_lower(email_addr or "")
        # Prefer email if it looks valid
        if email_addr and "@" in email_addr and "." in email_addr.split("@")[-1]:
            canon.append(email_addr)
        else:
            # Fall back to display name
            if display:
                canon.append(display.lower())
            else:
                canon.append(norm_whitespace(f"{display} {email_addr}").lower())
    return ";".join([c for c in canon if c])
def strip_html_to_text(html: str) -> str:
    """
    Crude HTML-to-text conversion for body snippet extraction.

    text

    Removes:
    - <script> and <style> tags and their contents
    - All other HTML tags
    - Common HTML entities

    Then normalizes whitespace.
    """
    text = re.sub(r"(?is)<(script|style).*?>.*?</\1>", " ", html)
    text = re.sub(r"(?is)<.*?>", " ", text)
    text = text.replace("&nbsp;", " ").replace("&amp;", "&")
    text = text.replace("&lt;", "<").replace("&gt;", ">")
    return norm_whitespace(text)
def round_dt_to_minute_key(dt: datetime | None) -> str:
    """
    Convert a datetime to a string key rounded to the nearest minute.

    text

    Used for bucketing messages by time (ignoring seconds/microseconds).

    Returns empty string if dt is None.
    """
    if not dt:
        return ""
    dt2 = dt.replace(second=0, microsecond=0)
    return dt2.strftime("%Y-%m-%d %H:%M")
# ==============================================================================
# [7] DATE PARSING AND COMPARISON LOGIC
# ==============================================================================
def parse_date_to_compare(
    raw_date: str, folder_tag: str
) -> tuple[datetime | None, datetime | None, datetime | None, str, str, str, str]:
    """
    Parse an email Date header into three datetime variants:

    text

    1. date_utc:        Always in UTC (best for absolute time comparison)
    2. date_wallclock:  Naive datetime (timezone stripped) ? preserves wall time
    3. date_compare:    Normalized for matching (applies DATE_COMPARE_*_MODE)

    Also returns minute-level string keys for bucketing and a status note.

    The DATE_COMPARE modes:
    - "none":              date_compare = date_wallclock (offset ignored)
    - "subtract_abs_offset": date_compare = date_wallclock - abs(tz_offset)
        This compensates for timezone differences while preserving DST.

    Args:
        raw_date:     Raw Date header string from email
        folder_tag:   "A" or "B" (determines which DATE_COMPARE_*_MODE to use)

    Returns:
        (date_utc, date_wallclock, date_compare,
         utc_minute_key, wallclock_minute_key, compare_minute_key,
         status_note)
    """
    if not raw_date:
        return None, None, None, "", "", "", "NoDateHeader"

    try:
        dt = parsedate_to_datetime(raw_date)
        if dt is None:
            return None, None, None, "", "", "", "DateParseReturnedNone"

        # Handle naive datetimes (no timezone info)
        if dt.tzinfo is None:
            local_dt = dt.astimezone()  # Assume system local timezone
            dt_utc = local_dt.astimezone(timezone.utc)
            dt_wallclock = dt.replace(tzinfo=None)
            dt_compare = dt_wallclock
            return (
                dt_utc,
                dt_wallclock,
                dt_compare,
                round_dt_to_minute_key(dt_utc),
                round_dt_to_minute_key(dt_wallclock),
                round_dt_to_minute_key(dt_compare),
                "NaiveAssumedLocal",
            )

        # Aware datetime ? convert to UTC and extract wallclock
        dt_utc = dt.astimezone(timezone.utc)
        dt_wallclock = dt.replace(tzinfo=None)

        # Apply DATE_COMPARE mode
        offset = dt.utcoffset() or timedelta(0)
        abs_off = abs(offset)

        mode = DATE_COMPARE_A_MODE if folder_tag == "A" else DATE_COMPARE_B_MODE
        if mode == "subtract_abs_offset":
            dt_compare = dt_wallclock - abs_off
        else:
            dt_compare = dt_wallclock

        return (
            dt_utc,
            dt_wallclock,
            dt_compare,
            round_dt_to_minute_key(dt_utc),
            round_dt_to_minute_key(dt_wallclock),
            round_dt_to_minute_key(dt_compare),
            "OK",
        )
    except Exception:
        return None, None, None, "", "", "", "DateParseUnparseable"
# ==============================================================================
# [8] BODY SIGNATURE EXTRACTION (with caching)
# ==============================================================================
def body_snippet_signature_from_msg(msg) -> str:
    """
    Extract a SHA1 signature from the email body.

    text

    Strategy:
    1. Walk all MIME parts
    2. Skip attachments (Content-Disposition: attachment)
    3. Prefer text/plain parts; fall back to text/html (stripped to text)
    4. Take first BODY_SNIPPET_MATCHING_CHARS characters
    5. Hash with SHA1

    Returns empty string on failure or if no body found.
    """
    snippets: list[tuple[str, str]] = []

    if msg.is_multipart():
        # Walk all parts recursively
        for part in msg.walk():
            ctype = (part.get_content_type() or "").lower()
            disp = (part.get_content_disposition() or "").lower()
            if disp == "attachment":
                continue  # Skip attachments
            if ctype in ("text/plain", "text/html"):
                try:
                    payload = part.get_content()
                except Exception:
                    # Fallback: raw decode
                    try:
                        payload = part.get_payload(decode=True)
                        if isinstance(payload, (bytes, bytearray)):
                            payload = payload.decode(
                                part.get_content_charset() or "utf-8", errors="replace"
                            )
                        else:
                            payload = str(payload)
                    except Exception:
                        payload = ""
                if not payload:
                    continue
                if ctype == "text/html":
                    payload = strip_html_to_text(payload)
                payload = norm_whitespace(payload)
                if payload:
                    snippets.append((ctype, payload))
    else:
        # Single-part message
        ctype = (msg.get_content_type() or "").lower()
        try:
            payload = msg.get_content()
        except Exception:
            try:
                payload = msg.get_payload(decode=True)
                if isinstance(payload, (bytes, bytearray)):
                    payload = payload.decode(
                        msg.get_content_charset() or "utf-8", errors="replace"
                    )
                else:
                    payload = str(payload)
            except Exception:
                payload = ""
        if payload:
            if ctype == "text/html":
                payload = strip_html_to_text(payload)
            payload = norm_whitespace(payload)
            if payload:
                snippets.append((ctype, payload))

    # Prefer text/plain; fall back to first available
    chosen = ""
    for ctype, payload in snippets:
        if ctype == "text/plain":
            chosen = payload
            break
    if not chosen and snippets:
        chosen = snippets[0][1]
    if not chosen:
        return ""

    # Hash the snippet
    chosen = chosen[:BODY_SNIPPET_MATCHING_CHARS]
    return hashlib.sha1(chosen.encode("utf-8", errors="replace")).hexdigest()
def compute_body_sig_for_path(path: str) -> str:
    """
    Compute (or retrieve from cache) the body signature for an .eml file.

    text

    Uses global BODY_SIG_CACHE to avoid re-reading files.
    Respects BODY_SNIPPET_READ_BYTES_LIMIT for memory safety.
    """
    cached = BODY_SIG_CACHE.get(path)
    if cached is not None:
        return cached

    try:
        limit = BODY_SNIPPET_READ_BYTES_LIMIT
        with open(path, "rb") as f:
            raw = f.read(limit) if limit > 0 else f.read()
        # Strip UTF-8 BOM (PowerShell exports) and normalize line endings
        if raw.startswith(b'\xef\xbb\xbf'):
            raw = raw[3:]
        raw = raw.replace(b'\
\
', b'\
').replace(b'\
', b'\
')

        msg = BytesParser(policy=policy.default).parsebytes(raw)
        if not msg.keys():
            logging.warning(f"[BODY_SIG-DEBUG] Available headers: []. Path: {path}")
        sig = body_snippet_signature_from_msg(msg)
    except Exception:
        sig = ""

    BODY_SIG_CACHE[path] = sig
    return sig
# ==============================================================================
# [9] EML PARSING ? FROM FILE TO MsgInfo
# ==============================================================================
def parse_eml(path: str, folder_tag: str, want_body_sig: bool) -> MsgInfo:
    """
    Parse a single .eml file into a MsgInfo dataclass.

    text

    Args:
        path:           Full path to .eml file
        folder_tag:     "A" or "B"
        want_body_sig:  If True, compute body signature (slower but needed for matching)

    Reading strategy:
    - If want_body_sig: read entire file (or up to BODY_SNIPPET_READ_BYTES_LIMIT)
    - Otherwise: read entire file (future: could cap with HEADER_READ_BYTES_LIMIT)
    """
    with open(path, "rb") as f:
        if want_body_sig:
            limit = BODY_SNIPPET_READ_BYTES_LIMIT
            raw = f.read(limit) if limit > 0 else f.read()
        else:
            if HEADER_READ_BYTES_LIMIT and HEADER_READ_BYTES_LIMIT > 0:
                raw = f.read(HEADER_READ_BYTES_LIMIT)
            elif READ_BYTES_LIMIT and READ_BYTES_LIMIT > 0:
                raw = f.read(READ_BYTES_LIMIT)
            else:
                raw = f.read()
    # Fix BOM and line endings before parsing
    if raw.startswith(b'\xef\xbb\xbf'):
        raw = raw[3:]
    raw = raw.replace(b'\
\
', b'\
').replace(b'\
', b'\
')

    msg = BytesParser(policy=policy.default).parsebytes(raw)
    
    if not msg.keys():
        logging.warning(f"[{folder_tag}-DEBUG] Available headers: []. Path: {path} First 200 bytes: {raw[:200]!r}")

    # Extract Message-ID (strip <> brackets)
    message_id = safe_lower(msg.get("Message-ID", "")).strip()
    if message_id.startswith("<") and message_id.endswith(">"):
        message_id = message_id[1:-1].strip()

    # Parse date into three variants
    raw_date = msg.get("Date", "")
    dt_utc, dt_wallclock, dt_compare, utc_min, wall_min, comp_min, date_note = parse_date_to_compare(
        raw_date, folder_tag
    )

    # Subject normalization
    raw_subject = msg.get("Subject", "")
    subj_norm, subj_noprefix = normalize_subject(raw_subject)

    # Address normalization
    from_norm = normalize_addresses(msg.get("From", ""))
    to_norm = normalize_addresses(msg.get("To", ""))
    cc_norm = normalize_addresses(msg.get("Cc", ""))
    bcc_norm = normalize_addresses(msg.get("Bcc", ""))

    # Recipient fallback logic for matching keys:
    # If To is empty (common for Sent Items), use Cc/Bcc
    recipient_for_key_norm = ""
    if to_norm:
        recipient_for_key_norm = to_norm
    else:
        parts = []
        if cc_norm:
            parts.append("Cc:" + cc_norm)
        if bcc_norm:
            parts.append("Bcc:" + bcc_norm)
        recipient_for_key_norm = "|".join(parts)

    recipient_for_key_sig = (
        hashlib.sha1(recipient_for_key_norm.encode("utf-8", errors="replace")).hexdigest()
        if recipient_for_key_norm
        else ""
    )

    to_trunc_for_output = truncate_for_output(to_norm, OUTPUT_RECIPIENT_LIST_MAX_CHARS)

    # Body signature (computed only if requested)
    body_sig = ""
    if want_body_sig:
        try:
            body_sig = body_snippet_signature_from_msg(msg)
        except Exception:
            body_sig = ""

    return MsgInfo(
        folder_tag=folder_tag,
        path=path,
        message_id=message_id,
        date_utc=dt_utc,
        date_wallclock=dt_wallclock,
        date_compare=dt_compare,
        date_utc_minute=utc_min,
        date_wallclock_minute=wall_min,
        date_compare_minute=comp_min,
        subject_norm=subj_norm,
        subject_norm_noprefix=subj_noprefix,
        from_norm=from_norm,
        to_norm=to_norm,
        cc_norm=cc_norm,
        bcc_norm=bcc_norm,
        recipient_for_key_norm=recipient_for_key_norm,
        recipient_for_key_sig=recipient_for_key_sig,
        to_trunc_for_output=to_trunc_for_output,
        body_sig=body_sig,
        date_parse_note=date_note,
    )
# ==============================================================================
# [10] MATCH KEY BUILDERS
# ==============================================================================
# These functions create string keys used for bucketing and matching.
# Format: "YYYY-MM-DD HH:MM|subject_text|optional_extras"
def make_key_date_subject(info: MsgInfo, use_noprefix: bool) -> str:
    """Key = Date_Compare(minute) + Subject."""
    subj = info.subject_norm_noprefix if use_noprefix else info.subject_norm
    return f"{info.date_compare_minute}|{subj}"

def make_key_date_subject_from_recipient_sig(info: MsgInfo, use_noprefix: bool) -> str:
    """
    Strengthened key = Date_Compare + Subject + From + Recipient hash.

    text

    This catches cases where the same person sends the same subject
    to different recipients at the same time.
    """
    subj = info.subject_norm_noprefix if use_noprefix else info.subject_norm
    return f"{info.date_compare_minute}|{subj}|{info.from_norm}|{info.recipient_for_key_sig}"
def minute_key_with_offset_date_compare(info: MsgInfo, minutes_offset: int, use_noprefix: bool) -> str:
    """
    Key with time drift applied.

    text

    Used to match messages where clocks differ by a few minutes.
    """
    if not info.date_compare:
        return ""
    dt2 = info.date_compare + timedelta(minutes=minutes_offset)
    dt2 = dt2.replace(second=0, microsecond=0)
    dt_key = dt2.strftime("%Y-%m-%d %H:%M")
    subj = info.subject_norm_noprefix if use_noprefix else info.subject_norm
    return f"{dt_key}|{subj}"
# ==============================================================================
# [11] CSV AND HYPERLINK UTILITIES
# ==============================================================================
def excel_hyperlink_for_path(path: str, label: str = "Open") -> str:
    """
    Generate an Excel HYPERLINK formula for a file path.

    text

    Excel requires file:/// prefix for local paths and file:// for UNC.
    We convert backslashes to forward slashes for compatibility.
    """
    if not path:
        return ""
    # FIX: Use native Windows path - do NOT percent-encode. Windows does not decode %XX.
    p = path.replace('"', '""')  # escape quotes for Excel formula
    return f'=HYPERLINK("{p}","{label}")'
def export_inventory_csv(infos: list[MsgInfo], out_path: str, folder_tag: str) -> None:
    """
    Write an inventory CSV for a list of MsgInfo objects.

    text

    This is a simple listing ? useful for debugging and manual review.
    """
    with open(out_path, "w", newline="", encoding="utf-8-sig") as f:
        w = csv.writer(f)
        w.writerow(["Key", "Date", "Date_Compare", "Subject", "Filename", "Filename_Link", "DateParseNote"])
        for info in infos:
            key = make_key_date_subject(info, use_noprefix=True)
            w.writerow([
                key,
                info.date_wallclock_minute,
                info.date_compare_minute,
                info.subject_norm_noprefix,
                info.path,
                excel_hyperlink_for_path(info.path, f"Open {folder_tag}"),
                info.date_parse_note,
            ])
# ==============================================================================
# [12] DUPLICATE DETECTION IN FOLDER A
# ==============================================================================
def detect_duplicates_in_a(
    a_infos: list[MsgInfo],
) -> tuple[list[dict], set[str], int, int]:
    """
    Detect duplicate messages in Folder A.

    text

    Strategy:
    1. Bucket by Date_Compare + Subject (+ optional From + Recipient)
    2. Within each bucket, sub-bucket by body signature
    3. Any sub-bucket with >1 message is considered duplicates
    4. First file in each sub-bucket is "original", rest are "duplicates"

    Returns:
        dup_rows:                    List of dicts (for CSV export)
        duplicates_to_exclude:       Set of paths to exclude from matching
        duplicate_subgroup_count:    Number of duplicate groups found
        duplicate_messages_count:    Total messages in duplicate groups
    """
    buckets: dict[str, list[MsgInfo]] = {}

    # Phase 1: Create primary buckets
    for info in a_infos:
        if DUPLICATE_BUCKET_INCLUDE_FROM_AND_RECIPIENT_SIG:
            k = make_key_date_subject_from_recipient_sig(info, use_noprefix=True)
        else:
            k = make_key_date_subject(info, use_noprefix=True)
        buckets.setdefault(k, []).append(info)

    dup_rows: list[dict] = []
    duplicates_to_exclude: set[str] = set()

    subgroup_counter = 0
    duplicate_messages_count = 0

    # Phase 2: Split buckets by body signature
    for k, infos in buckets.items():
        if len(infos) <= 1:
            continue  # No duplicates in this bucket

        sig_map: dict[str, list[MsgInfo]] = {}
        for info in infos:
            sig_key = compute_body_sig_for_path(info.path) if USE_BODY_SNIPPET_MATCHING else "(no_body_sig)"
            sig_map.setdefault(sig_key if sig_key else "(no_body_sig)", []).append(info)

        for sig, sig_infos in sig_map.items():
            if len(sig_infos) <= 1:
                continue  # Same time+subject but different body = not duplicates

            subgroup_counter += 1
            duplicate_messages_count += len(sig_infos)

            # Sort by path for deterministic "original" selection
            sig_infos_sorted = sorted(sig_infos, key=lambda x: x.path.lower())
            original_path = sig_infos_sorted[0].path

            # Mark all but first as duplicates to exclude
            for x in sig_infos_sorted[1:]:
                duplicates_to_exclude.add(x.path)

            # Build CSV rows for reporting
            for info in sig_infos_sorted:
                dup_rows.append({
                    "duplicate_bucket_key": k,
                    "duplicate_subgroup_id": subgroup_counter,
                    "count_in_subgroup": len(sig_infos_sorted),
                    "path": info.path,
                    "path_link": excel_hyperlink_for_path(info.path, "Open"),
                    "is_original_in_subgroup": "1" if info.path == original_path else "0",
                    "message_id": info.message_id,
                    "date_wallclock_minute": info.date_wallclock_minute,
                    "date_compare_minute": info.date_compare_minute,
                    "subject_norm_noprefix": info.subject_norm_noprefix,
                    "from_norm": info.from_norm,
                    "to_norm": info.to_norm,
                    "cc_norm": info.cc_norm,
                    "bcc_norm": info.bcc_norm,
                    "recipient_for_key_norm": truncate_for_output(info.recipient_for_key_norm, OUTPUT_RECIPIENT_LIST_MAX_CHARS),
                    "recipient_for_key_sig": info.recipient_for_key_sig,
                    "body_sig": sig,
                })

    return dup_rows, duplicates_to_exclude, subgroup_counter, duplicate_messages_count
# ==============================================================================
# [13] CACHE HELPERS
# ==============================================================================
# The cache system avoids re-parsing .eml files when nothing has changed.
# Cache keys are derived from file paths and configuration parameters.
def sha1_of_sorted_paths(paths: list[str]) -> str:
    """
    Compute a fingerprint of a file list.

    text

    Sorts paths case-insensitively first, so order doesn't matter.
    Used to detect when files have been added/removed/renamed.
    """
    paths_sorted = sorted(paths, key=lambda x: x.lower())
    h = hashlib.sha1()
    for p in paths_sorted:
        h.update(p.encode("utf-8", errors="ignore"))
        h.update(b"\
")
    return h.hexdigest()
def make_cache_params_hash(params: dict) -> str:
    """
    Hash the configuration parameters that affect parsing.

    text

    If any of these change, the cache is invalidated.
    """
    s = repr(sorted(params.items(), key=lambda kv: kv[0]))
    return hashlib.sha1(s.encode("utf-8", errors="ignore")).hexdigest()[:16]
def load_pickle_if_valid(path: str, expected_paths_fp: str, expected_params: dict):
    """
    Load a cached pickle file if:
    - File exists
    - Schema version matches
    - File list fingerprint matches
    - Parameters hash matches

    text

    Returns None if any check fails (cache is stale or corrupt).
    """
    if not os.path.exists(path):
        return None
    try:
        with open(path, "rb") as f:
            obj = pickle.load(f)
        if not isinstance(obj, dict):
            return None
        if obj.get("schema_version") != CACHE_SCHEMA_VERSION:
            return None
        if obj.get("paths_fp") != expected_paths_fp:
            return None
        if obj.get("params") != expected_params:
            return None
        return obj.get("data")
    except Exception:
        return None
def save_pickle(path: str, payload: dict) -> None:
    """
    Save a dict to a pickle file atomically.

    text

    Uses a .tmp file first, then renames to final name.
    Prevents partial writes corrupting the cache.
    """
    tmp = path + ".tmp"
    with open(tmp, "wb") as f:
        pickle.dump(payload, f, protocol=pickle.HIGHEST_PROTOCOL)
    os.replace(tmp, path)
# ==============================================================================
# [14] BEEP + PROMPT + DUPLICATE MOVING
# ==============================================================================
def beep_10_times() -> None:
    """
    Alert the user with sound before the duplicate move prompt.

    text

    Tries winsound (Windows) first, falls back to terminal bell.
    """
    try:
        import winsound  # Windows-specific
        for _ in range(10):
            winsound.Beep(1200, 120)
            time.sleep(0.05)
    except Exception:
        for _ in range(10):
            print("\a", end="", flush=True)  # Terminal bell
            time.sleep(0.15)
        print("")
def prompt_move_duplicates() -> bool:
    """
    Ask the user whether to move detected duplicates.

    text

    Returns True if user confirms (Y/yes), False otherwise.
    """
    beep_10_times()
    while True:
        ans = input("Move Folder A duplicate .eml files (second+ items) to OUTPUT_DIR\\FolderA_Duplicates? (Y/N): ").strip().lower()
        if ans in ("y", "yes"):
            return True
        if ans in ("n", "no", ""):
            return False
        print("Please enter Y or N.")
def move_paths_to_duplicate_folder(paths_to_move: set[str], duplicates_dir: str) -> int:
    """
    Move duplicate files to a quarantine folder.

    text

    Handles filename collisions by appending _dup1, _dup2, etc.

    Returns number of files successfully moved.
    """
    moved = 0
    if not paths_to_move:
        return 0
    ensure_output_dir(duplicates_dir)

    existing_names = set(os.listdir(duplicates_dir))
    for p in sorted(paths_to_move, key=lambda x: x.lower()):
        if not os.path.exists(p):
            continue

        base = os.path.basename(p)
        target = os.path.join(duplicates_dir, base)

        # Handle name collisions
        if os.path.basename(target) in existing_names:
            stem, ext = os.path.splitext(base)
            i = 1
            while True:
                candidate = os.path.join(duplicates_dir, f"{stem}_dup{i}{ext}")
                if os.path.basename(candidate) not in existing_names:
                    target = candidate
                    break
                i += 1

        try:
            shutil.move(p, target)
            moved += 1
            existing_names.add(os.path.basename(target))
        except Exception as e:
            logging.warning(f"Failed to move duplicate file: {p} -> {target} ({e})")

    return moved
# ==============================================================================
# [15] MAIN FUNCTION ? ORCHESTRATES EVERYTHING
# ==============================================================================
def main() -> int:
    """
    Main entry point.

    text

    Workflow:
    1. Setup output directories and logging
    2. Discover .eml files in both folders
    3. Parse Folder B first (build lookup indices)
    4. Parse Folder A (headers only initially)
    5. Export inventory CSVs
    6. Detect duplicates in Folder A (optional move)
    7. Multi-pass matching of A against B
    8. Export results CSV and unmatched B CSV
    9. Print summary

    Returns 0 on success.
    """
    # --- Phase 0: Setup ---
    ensure_output_dir(OUTPUT_DIR)

    log_path = os.path.join(OUTPUT_DIR, LOG_FILENAME)
    csv_path = os.path.join(OUTPUT_DIR, CSV_FILENAME)
    dup_csv_path = os.path.join(OUTPUT_DIR, DUPLICATES_A_CSV_FILENAME)

    inv_a_csv_path = os.path.join(OUTPUT_DIR, INVENTORY_A_CSV_FILENAME)
    inv_b_csv_path = os.path.join(OUTPUT_DIR, INVENTORY_B_CSV_FILENAME)
    unmatched_b_csv_path = os.path.join(OUTPUT_DIR, UNMATCHED_B_CSV_FILENAME)

    setup_logging(log_path)

    # Log configuration
    log_and_print(f"EML Comparison Script Version: {VERSION}")
    log_and_print(f"DRIFT_WINDOW_MINUTES Used: {DRIFT_WINDOW_MINUTES}")
    log_and_print(f"Folder A: {FOLDER_A_PATH}")
    log_and_print(f"Folder B: {FOLDER_B_PATH}")
    log_and_print(f"Output Dir: {OUTPUT_DIR}")
    log_and_print(f"Log File: {log_path}")
    log_and_print(f"CSV File: {csv_path}")
    log_and_print(f"Inventory A CSV: {inv_a_csv_path}")
    log_and_print(f"Inventory B CSV: {inv_b_csv_path}")
    log_and_print(f"Duplicates-A CSV: {dup_csv_path}")
    log_and_print(f"Unmatched B CSV: {unmatched_b_csv_path}")
    log_and_print(f"Max Files A: {'All' if MAX_FILES_TO_PROCESS_A <= 0 else MAX_FILES_TO_PROCESS_A}")
    log_and_print(f"Max Files B: {'All' if MAX_FILES_TO_PROCESS_B <= 0 else MAX_FILES_TO_PROCESS_B}")

    # --- Phase 1: Discover files ---
    a_files = iter_eml_files(FOLDER_A_PATH, MAX_FILES_TO_PROCESS_A)
    b_files = iter_eml_files(FOLDER_B_PATH, MAX_FILES_TO_PROCESS_B)
    log_and_print(f"Discovered .eml files: A={len(a_files)}, B={len(b_files)}")

    # --- Phase 2: Cache setup ---
    cache_dir = os.path.join(OUTPUT_DIR, CACHE_DIR_NAME)
    ensure_output_dir(cache_dir)

    a_paths_fp = sha1_of_sorted_paths(a_files)
    b_paths_fp = sha1_of_sorted_paths(b_files)

    # Configuration parameters that affect parsing (cache key)
    cache_params = {
        "schema_version": CACHE_SCHEMA_VERSION,
        "DATE_COMPARE_A_MODE": DATE_COMPARE_A_MODE,
        "DATE_COMPARE_B_MODE": DATE_COMPARE_B_MODE,
        "USE_BODY_SNIPPET_MATCHING": USE_BODY_SNIPPET_MATCHING,
        "USE_FROM_TO_IN_KEYS": USE_FROM_TO_IN_KEYS,
        "DUPLICATE_BUCKET_INCLUDE_FROM_AND_RECIPIENT_SIG": DUPLICATE_BUCKET_INCLUDE_FROM_AND_RECIPIENT_SIG,
        "BODY_SNIPPET_MATCHING_CHARS": BODY_SNIPPET_MATCHING_CHARS,
        "BODY_SNIPPET_READ_BYTES_LIMIT": BODY_SNIPPET_READ_BYTES_LIMIT,
        "READ_BYTES_LIMIT": READ_BYTES_LIMIT,
        "HEADER_READ_BYTES_LIMIT": HEADER_READ_BYTES_LIMIT,
        "OUTPUT_RECIPIENT_LIST_MAX_CHARS": OUTPUT_RECIPIENT_LIST_MAX_CHARS,
        "USE_METADATA_CACHE": USE_METADATA_CACHE,
    }
    params_hash = make_cache_params_hash(cache_params)

    a_meta_cache_path = os.path.join(cache_dir, f"A_meta_{params_hash}_{a_paths_fp}.pkl")
    b_meta_cache_path = os.path.join(cache_dir, f"B_meta_{params_hash}_{b_paths_fp}.pkl")
    duplicates_cache_path = os.path.join(cache_dir, f"A_dups_{params_hash}_{a_paths_fp}.pkl")

    # Load persisted body signatures for Folder A (speeds up duplicate detection)
    if USE_BODY_SIG_CACHE_PERSIST:
        body_sig_cache_path = os.path.join(cache_dir, f"body_sigs_A_{params_hash}_{a_paths_fp}.pkl")
        if (not CACHE_FORCE_REBUILD) and os.path.exists(body_sig_cache_path):
            try:
                with open(body_sig_cache_path, "rb") as f:
                    obj = pickle.load(f)
                if (
                    isinstance(obj, dict)
                    and obj.get("schema_version") == CACHE_SCHEMA_VERSION
                    and obj.get("paths_fp") == a_paths_fp
                ):
                    persisted = obj.get("data", {})
                    if isinstance(persisted, dict):
                        # CRITICAL: mutate global dict, don't re-bind
                        BODY_SIG_CACHE.clear()
                        BODY_SIG_CACHE.update(persisted)
                        log_and_print(f"Loaded persisted A body signatures: {len(BODY_SIG_CACHE)} entries")
            except Exception:
                pass

    # --- Phase 3: Parse Folder B and build indices ---
    b_infos: list[MsgInfo] = []
    loaded_from_cache_b = False
    use_cache_b = USE_METADATA_CACHE and (not CACHE_FORCE_REBUILD)

    if use_cache_b:
        cached_b = load_pickle_if_valid(b_meta_cache_path, expected_paths_fp=b_paths_fp, expected_params=cache_params)
        if cached_b is not None:
            b_infos = cached_b
            loaded_from_cache_b = True
            log_and_print(f"Loaded Folder B parsed metadata from cache: {len(b_infos)} messages")

    if not b_infos:
        log_and_print("Parsing Folder B and building indices...")
        start_b = time.time()
        b_infos = []
        for i, path in enumerate(b_files, start=1):
            progress_update("[B] Parsing", i, len(b_files), start_b, path)
            if i % 200 == 0 or i == 1 or i == len(b_files):
                logging.info(f"[B] Parsed {i}/{len(b_files)} {int(time.time() - start_b)}s elapsed : {path}")
            info = parse_eml(path, "B", want_body_sig=USE_BODY_SNIPPET_MATCHING)
            b_infos.append(info)

        if USE_METADATA_CACHE:
            payload = {"schema_version": CACHE_SCHEMA_VERSION, "paths_fp": b_paths_fp, "params": cache_params, "data": b_infos}
            try:
                save_pickle(b_meta_cache_path, payload)
                log_and_print(f"Saved Folder B parsed metadata cache: {b_meta_cache_path}")
            except Exception as e:
                log_and_print(f"WARNING: could not save B cache: {e}")

    # Build lookup dictionaries for fast matching
    b_by_mid: dict[str, list[MsgInfo]] = {}
    b_by_ds_noprefix: dict[str, list[MsgInfo]] = {}
    b_by_ds: dict[str, list[MsgInfo]] = {}
    b_by_dsft_noprefix: dict[str, list[MsgInfo]] = {}
    b_by_dsft: dict[str, list[MsgInfo]] = {}
    b_by_body_sig: dict[str, list[MsgInfo]] = {}

    for info in b_infos:
        if info.message_id:
            b_by_mid.setdefault(info.message_id, []).append(info)

        k_nop = make_key_date_subject(info, use_noprefix=True)
        k_full = make_key_date_subject(info, use_noprefix=False)
        b_by_ds_noprefix.setdefault(k_nop, []).append(info)
        b_by_ds.setdefault(k_full, []).append(info)

        if USE_FROM_TO_IN_KEYS:
            k_ft_nop = make_key_date_subject_from_recipient_sig(info, use_noprefix=True)
            k_ft_full = make_key_date_subject_from_recipient_sig(info, use_noprefix=False)
            b_by_dsft_noprefix.setdefault(k_ft_nop, []).append(info)
            b_by_dsft.setdefault(k_ft_full, []).append(info)

        if USE_BODY_SNIPPET_MATCHING and info.body_sig:
            b_by_body_sig.setdefault(info.body_sig, []).append(info)

    log_and_print(
        f"Finished parsing Folder B. Index sizes: "
        f"mid={len(b_by_mid)} "
        f"ds_noprefix={len(b_by_ds_noprefix)} "
        f"body_sig={len(b_by_body_sig)}"
    )

    # --- Phase 4: Parse Folder A (headers only, body sig computed lazily) ---
    a_infos: list[MsgInfo] = []
    loaded_from_cache_a = False
    use_cache_a = USE_METADATA_CACHE and (not CACHE_FORCE_REBUILD)

    if use_cache_a:
        cached_a = load_pickle_if_valid(a_meta_cache_path, expected_paths_fp=a_paths_fp, expected_params=cache_params)
        if cached_a is not None:
            a_infos = cached_a
            loaded_from_cache_a = True
            log_and_print(f"Loaded Folder A parsed metadata from cache: {len(a_infos)} messages")

    if not a_infos:
        log_and_print("Parsing Folder A (headers; body sig computed lazily when needed)...")
        start_a = time.time()
        a_infos = []
        for i, path in enumerate(a_files, start=1):
            progress_update("[A] Parsing", i, len(a_files), start_a, path)
            if i % 200 == 0 or i == 1 or i == len(a_files):
                logging.info(f"[A] Parsed {i}/{len(a_files)} {int(time.time() - start_a)}s elapsed : {path}")
            a_infos.append(parse_eml(path, "A", want_body_sig=False))

        if USE_METADATA_CACHE:
            payload = {"schema_version": CACHE_SCHEMA_VERSION, "paths_fp": a_paths_fp, "params": cache_params, "data": a_infos}
            try:
                save_pickle(a_meta_cache_path, payload)
                log_and_print(f"Saved Folder A parsed metadata cache: {a_meta_cache_path}")
            except Exception as e:
                log_and_print(f"WARNING: could not save A cache: {e}")

    # --- Phase 5: Inventory exports ---
    log_and_print("Writing inventory CSVs for A and B...")
    export_inventory_csv(a_infos, inv_a_csv_path, "A")
    export_inventory_csv(b_infos, inv_b_csv_path, "B")

    # --- Phase 6: Duplicate detection in Folder A ---
    duplicate_paths_in_a: set[str] = set()
    duplicate_subgroup_count = 0
    duplicate_messages_count = 0
    moved_duplicates_count = 0

    dup_rows: list[dict] = []
    duplicates_to_exclude: set[str] = set()
    exclude_duplicates_for_matching = EXCLUDE_DUPLICATES_IN_A_FROM_MATCHING_DEFAULT

    if DETECT_DUPLICATES_IN_A:
        log_and_print("Detecting duplicates in Folder A...")
        loaded_dups = False

        if USE_DUPLICATE_CACHE and (not CACHE_FORCE_REBUILD) and os.path.exists(duplicates_cache_path):
            try:
                with open(duplicates_cache_path, "rb") as f:
                    obj = pickle.load(f)
                if (
                    isinstance(obj, dict)
                    and obj.get("schema_version") == CACHE_SCHEMA_VERSION
                    and obj.get("paths_fp") == a_paths_fp
                    and obj.get("params") == cache_params
                ):
                    dup_rows = obj.get("dup_rows", []) or []
                    duplicates_to_exclude = set(obj.get("duplicates_to_exclude", []) or [])
                    duplicate_subgroup_count = int(obj.get("duplicate_subgroup_count", 0) or 0)
                    duplicate_messages_count = int(obj.get("duplicate_messages_count", 0) or 0)
                    loaded_dups = True
                    log_and_print("Loaded duplicates detection results from cache.")
            except Exception:
                loaded_dups = False

        if not loaded_dups:
            dup_rows, duplicates_to_exclude, duplicate_subgroup_count, duplicate_messages_count = detect_duplicates_in_a(a_infos)

            if USE_DUPLICATE_CACHE:
                try:
                    payload = {
                        "schema_version": CACHE_SCHEMA_VERSION,
                        "paths_fp": a_paths_fp,
                        "params": cache_params,
                        "data": None,
                        "dup_rows": dup_rows,
                        "duplicates_to_exclude": list(duplicates_to_exclude),
                        "duplicate_subgroup_count": duplicate_subgroup_count,
                        "duplicate_messages_count": duplicate_messages_count,
                    }
                    save_pickle(duplicates_cache_path, payload)
                    log_and_print(f"Saved duplicates detection cache: {duplicates_cache_path}")
                except Exception as e:
                    log_and_print(f"WARNING: could not save duplicates cache: {e}")

        # Write duplicates CSV
        with open(dup_csv_path, "w", newline="", encoding="utf-8-sig") as f:
            if dup_rows:
                w = csv.DictWriter(f, fieldnames=list(dup_rows[0].keys()))
                w.writeheader()
                for r in dup_rows:
                    w.writerow(r)
            else:
                csv.writer(f).writerow(["duplicate_bucket_key", "duplicate_subgroup_id", "count_in_subgroup"])

        log_and_print(f"Duplicate groups in A: {duplicate_subgroup_count}; duplicate messages in groups: {duplicate_messages_count}")

        # Prompt user to move duplicates
        if duplicates_to_exclude:
            log_and_print("Duplicate detection completed; about to alert you for the move prompt...")
            do_move = prompt_move_duplicates()
            if do_move:
                exclude_duplicates_for_matching = True
                duplicates_dir = os.path.join(OUTPUT_DIR, "FolderA_Duplicates")
                moved_duplicates_count = move_paths_to_duplicate_folder(duplicates_to_exclude, duplicates_dir)
                log_and_print(f"Moved duplicates to: {duplicates_dir}")
                log_and_print(f"Moved duplicates count: {moved_duplicates_count}")
            else:
                log_and_print("Chose not to move duplicates.")
        else:
            log_and_print("No duplicates detected to move.")

        if exclude_duplicates_for_matching:
            duplicate_paths_in_a = set(duplicates_to_exclude)

    # --- Phase 7: Multi-pass matching of A against B ---
    matched_b_paths: set[str] = set()
    stats = {
        "MatchedByMessageId": 0,
        "MatchedByDateSubjectExact_NoPrefix": 0,
        "MatchedByDateSubjectExact": 0,
        "MatchedByDriftDateSubject": 0,
        "MatchedByFromToStrengthened": 0,
        "MatchedByBodySnippet": 0,
        "UnmatchedA": 0,
        "UnmatchedB": 0,
        "ExcludedDuplicatesInA": 0,
        "DuplicateSubgroupCount": duplicate_subgroup_count,
        "DuplicateMessagesCount": duplicate_messages_count,
        "MovedDuplicatesCount": moved_duplicates_count,
        "LoadedFromCacheA": loaded_from_cache_a,
        "LoadedFromCacheB": loaded_from_cache_b,
    }

    results: list[dict] = []

    # Generate drift offsets: +1, -1, +2, -2, ... up to DRIFT_WINDOW_MINUTES
    offsets: list[int] = []
    for m in range(1, DRIFT_WINDOW_MINUTES + 1):
        offsets.append(m)
        offsets.append(-m)

    def first_unmatched_b(cands: list[MsgInfo]) -> MsgInfo | None:
        """Return first candidate whose path hasn't been matched yet."""
        for c in cands:
            if c.path not in matched_b_paths:
                return c
        return None

    log_and_print("Matching A against B (multi-pass)...")
    start_m = time.time()

    for idx, a in enumerate(a_infos, start=1):
        progress_update("[MATCH]", idx, len(a_infos), start_m, a.path)
        # Skip duplicates if configured
        if a.path in duplicate_paths_in_a:
            stats["ExcludedDuplicatesInA"] += 1
            continue

        if idx % 200 == 0 or idx == 1 or idx == len(a_infos):
            logging.info(f"[MATCH] {idx}/{len(a_infos)} {int(time.time() - start_m)}s elapsed : {a.path}")

        match: MsgInfo | None = None
        match_method = ""
        confidence = 0

        # PASS 1: Exact Message-ID (highest confidence)
        if a.message_id and a.message_id in b_by_mid:
            cand = first_unmatched_b(b_by_mid[a.message_id])
            if cand:
                match = cand
                match_method = "Message-ID"
                confidence = 100

        # PASS 2: Date + Subject exact (NoPrefix)
        if match is None:
            k = make_key_date_subject(a, use_noprefix=True)
            cand_list = b_by_ds_noprefix.get(k)
            if cand_list:
                cand = first_unmatched_b(cand_list)
                if cand:
                    match = cand
                    match_method = "DateSubjectExact(NoPrefix)"
                    confidence = 98

        # PASS 3: Date + Subject exact (full subject)
        if match is None:
            k = make_key_date_subject(a, use_noprefix=False)
            cand_list = b_by_ds.get(k)
            if cand_list:
                cand = first_unmatched_b(cand_list)
                if cand:
                    match = cand
                    match_method = "DateSubjectExact"
                    confidence = 97

        # PASS 4: Drift-tolerant Date + Subject
        if match is None and a.date_compare:
            for off in offsets:
                k = minute_key_with_offset_date_compare(a, off, use_noprefix=True)
                cand_list = b_by_ds_noprefix.get(k)
                if cand_list:
                    cand = first_unmatched_b(cand_list)
                    if cand:
                        match = cand
                        match_method = f"DriftDateSubject({off:+d}m)"
                        # Confidence degrades with larger offset
                        confidence = 95 if abs(off) <= 2 else (90 if abs(off) <= 5 else 85)
                        break

        # PASS 5: From/To strengthened matching
        if match is None and USE_FROM_TO_IN_KEYS:
            k = make_key_date_subject_from_recipient_sig(a, use_noprefix=True)
            cand_list = b_by_dsft_noprefix.get(k)
            if cand_list:
                cand = first_unmatched_b(cand_list)
                if cand:
                    match = cand
                    match_method = "FromToStrengthened(NoPrefix)"
                    confidence = 93

            if match is None:
                k = make_key_date_subject_from_recipient_sig(a, use_noprefix=False)
                cand_list = b_by_dsft.get(k)
                if cand_list:
                    cand = first_unmatched_b(cand_list)
                    if cand:
                        match = cand
                        match_method = "FromToStrengthened"
                        confidence = 92

        # PASS 6: Body snippet matching (lowest confidence, slowest)
        if match is None and USE_BODY_SNIPPET_MATCHING:
            a_sig = compute_body_sig_for_path(a.path)
            if a_sig and a_sig in b_by_body_sig:
                cand = first_unmatched_b(b_by_body_sig[a_sig])
                if cand:
                    match = cand
                    match_method = "BodySnippet"
                    confidence = 70

        # Record match or unmatched
        if match is not None:
            matched_b_paths.add(match.path)

            # Update stats counters
            if match_method == "Message-ID":
                stats["MatchedByMessageId"] += 1
            elif match_method.startswith("DateSubjectExact(NoPrefix)"):
                stats["MatchedByDateSubjectExact_NoPrefix"] += 1
            elif match_method.startswith("DateSubjectExact"):
                stats["MatchedByDateSubjectExact"] += 1
            elif match_method.startswith("DriftDateSubject"):
                stats["MatchedByDriftDateSubject"] += 1
            elif match_method.startswith("FromToStrengthened"):
                stats["MatchedByFromToStrengthened"] += 1
            elif match_method == "BodySnippet":
                stats["MatchedByBodySnippet"] += 1

            results.append({
                "A_Path": a.path,
                "A_Path_Link": excel_hyperlink_for_path(a.path, "Open A"),
                "B_Path": match.path,
                "B_Path_Link": excel_hyperlink_for_path(match.path, "Open B"),
                "MatchStatus": "Matched",
                "MatchMethod": match_method,
                "Confidence": confidence,
                "A_MessageID": a.message_id,
                "A_DateWallclock_Minute": a.date_wallclock_minute,
                "A_Date_Compare_Minute": a.date_compare_minute,
                "A_Subject": a.subject_norm,
                "A_Subject_NoPrefix": a.subject_norm_noprefix,
                "A_From": a.from_norm,
                "A_To": a.to_trunc_for_output,
                "A_Recipient_For_Key_Sig": a.recipient_for_key_sig,
                "B_MessageID": match.message_id,
                "B_DateWallclock_Minute": match.date_wallclock_minute,
                "B_Date_Compare_Minute": match.date_compare_minute,
                "B_Subject": match.subject_norm,
                "B_Subject_NoPrefix": match.subject_norm_noprefix,
                "B_From": match.from_norm,
                "B_To": match.to_trunc_for_output,
                "B_Recipient_For_Key_Sig": match.recipient_for_key_sig,
            })
        else:
            stats["UnmatchedA"] += 1
            results.append({
                "A_Path": a.path,
                "A_Path_Link": excel_hyperlink_for_path(a.path, "Open A"),
                "B_Path": "",
                "B_Path_Link": "",
                "MatchStatus": "Unmatched",
                "MatchMethod": "",
                "Confidence": 0,
                "A_MessageID": a.message_id,
                "A_DateWallclock_Minute": a.date_wallclock_minute,
                "A_Date_Compare_Minute": a.date_compare_minute,
                "A_Subject": a.subject_norm,
                "A_Subject_NoPrefix": a.subject_norm_noprefix,
                "A_From": a.from_norm,
                "A_To": a.to_trunc_for_output,
                "A_Recipient_For_Key_Sig": a.recipient_for_key_sig,
                "B_MessageID": "",
                "B_DateWallclock_Minute": "",
                "B_Date_Compare_Minute": "",
                "B_Subject": "",
                "B_Subject_NoPrefix": "",
                "B_From": "",
                "B_To": "",
                "B_Recipient_For_Key_Sig": "",
            })

    # --- Phase 8: Persist body signature cache ---
    if USE_BODY_SIG_CACHE_PERSIST:
        try:
            body_sig_cache_path = os.path.join(cache_dir, f"body_sigs_A_{params_hash}_{a_paths_fp}.pkl")
            payload = {
                "schema_version": CACHE_SCHEMA_VERSION,
                "paths_fp": a_paths_fp,
                "params": cache_params,
                "data": BODY_SIG_CACHE,
            }
            save_pickle(body_sig_cache_path, payload)
            log_and_print(f"Saved A body signature cache: {len(BODY_SIG_CACHE)} entries -> {body_sig_cache_path}")
        except Exception as e:
            log_and_print(f"WARNING: could not save A body signature cache: {e}")

    # --- Phase 9: Write results CSV ---
    log_and_print("Writing CSV report...")
    fieldnames = list(results[0].keys()) if results else ["A_Path", "A_Path_Link", "B_Path", "B_Path_Link", "MatchStatus", "MatchMethod", "Confidence"]

    with open(csv_path, "w", newline="", encoding="utf-8-sig") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for row in results:
            w.writerow(row)

    # --- Phase 10: Unmatched B CSV ---
    unmatched_b_infos = [info for info in b_infos if info.path not in matched_b_paths]
    stats["UnmatchedB"] = len(unmatched_b_infos)

    log_and_print(f"Writing Unmatched B CSV with {len(unmatched_b_infos)} files...")
    export_inventory_csv(unmatched_b_infos, unmatched_b_csv_path, "B")

    # --- Phase 11: Summary ---
    log_and_print("==== SUMMARY ====")
    log_and_print(f"Script Version Run: {VERSION}")
    log_and_print(f"DRIFT_WINDOW_MINUTES Used: {DRIFT_WINDOW_MINUTES}")
    log_and_print(f"Folder A total discovered: {len(a_files)}")
    log_and_print(f"Folder B total discovered: {len(b_files)}")
    log_and_print(f"Duplicate Subgroups in Folder A: {stats['DuplicateSubgroupCount']}")
    log_and_print(f"Duplicate Messages in Folder A: {stats['DuplicateMessagesCount']}")
    log_and_print(f"Moved duplicates count: {stats['MovedDuplicatesCount']}")
    log_and_print(f"Excluded duplicates from matching: {stats['ExcludedDuplicatesInA']}")
    log_and_print(f"Matched by Message-ID: {stats['MatchedByMessageId']}")
    log_and_print(f"Matched by Date+Subject exact (NoPrefix): {stats['MatchedByDateSubjectExact_NoPrefix']}")
    log_and_print(f"Matched by Date+Subject exact: {stats['MatchedByDateSubjectExact']}")
    log_and_print(f"Matched by Drift Date+Subject: {stats['MatchedByDriftDateSubject']}")
    log_and_print(f"Matched by From/To strengthened: {stats['MatchedByFromToStrengthened']}")
    log_and_print(f"Matched by Body snippet: {stats['MatchedByBodySnippet']}")
    log_and_print(f"Unmatched in A: {stats['UnmatchedA']}")
    log_and_print(f"Unmatched in B (B not matched to A): {stats['UnmatchedB']}")
    log_and_print(f"Unmatched B CSV: {unmatched_b_csv_path}")
    log_and_print(f"Loaded from cache - A: {stats['LoadedFromCacheA']}, B: {stats['LoadedFromCacheB']}")
    log_and_print(f"CSV report: {csv_path}")
    log_and_print(f"Inventory A CSV: {inv_a_csv_path}")
    log_and_print(f"Inventory B CSV: {inv_b_csv_path}")
    log_and_print(f"Duplicates-A CSV: {dup_csv_path}")
    log_and_print(f"Log file: {log_path}")
    log_and_print("Done.")

    return 0
if __name__ == "__main__":
    # Entry point ? ensures clean exit with proper return code
    raise SystemExit(main())

No comments

Windows,

Linux,

Microsoft,

You need to be logged in to comment.

Latest Activity