Multiple SPF records for one domain can cause problems in e-mail delivery

I have found, that on of my customer domains have problem to send messages outside their environment. Some messages got stuck in queue for several hours / days without any reason to do so. SMTP traffic was OK to most of other domains, but some had problem. I suspect, that the reason was more SPF TXT records for single domain. Example:

domain.com TXT="v=SPF1 mx host1.domain1.com ~all"
domain.com TXT="v=SPF1 mx host2.domain2.com ~all"
domain.com TXT="v=SPF1 mx host3.domain3.com ~all"

RFC 4408 stays, that no multiple SPF records should be available:

3.1.2.  Multiple DNS Records: A domain name MUST NOT have multiple records that would cause an
authorization check to select more than one record.  See Section 4.5
for the selection rules.

Explanation is quite logical. If there is more than one SPF record, permanent error is returned.

4.5.  Selecting Records
   Records begin with a version section:
   record           = version terms *SP
   version          = "v=spf1"
   Starting with the set of records that were returned by the lookup,
   record selection proceeds in two steps:
   1. Records that do not begin with a version section of exactly
      "v=spf1" are discarded.  Note that the version section is
      terminated either by an SP character or the end of the record.  A
      record with a version section of "v=spf10" does not match and must
      be discarded.

   2. If any records of type SPF are in the set, then all records of
      type TXT are discarded.
   After the above steps, there should be exactly one record remaining
   and evaluation can proceed.  If there are two or more records
   remaining, then check_host() exits immediately with the result of
   "PermError".
   If no matching records are returned, an SPF client MUST assume that
   the domain makes no SPF declarations.  SPF processing MUST stop and
   return "None".

Well. The cause of this “implementation” is, that some messages from domain containing wrong SPF record to domain with SPF check might be lost (-All) or delayed. I am going to investigate this further. If you have some experience with similar problem, please let me know.

Advertisements

Exchange 2013 SP1 – problem #1 – Powershell virtual directory malfunction – HTTP error (500)

This is known issue, but to remember myself for next versions: If you run EMS for Exchange 2013 SP1. Error comes out:500error

It has 3 possible issues. Here are solutions:

Root cause 1:

Exchange server is out of sync with time of DC. You should always have the following hierarchy of time sync in your domain: Reliable time source -> PDC -> Other DCs -> Servers and clients

  • Disable windows time sync from physical host if it is virtual machine
  • Enable time sync with domain by the following commands:
  • On PDC
net stop w32time 
w32tm /config /syncfromflags:manual /manualpeerlist:0.pool.ntp.org 
w32tm /config /reliable:yes 
net start w32time

On other DCs and Servers:

net stop w32time
w32tm /config /syncfromflags:domhier /reliable:no /update
net start w32time

Root cause 2:

Exchange server path to kerbauth.dll is wrong / Powershell virtual directory is misconfigured. I have re-created virtual directory for Powershell on affected server:

Get-PowerShellVirtualDirectory -Server <AffectedServer> | Remove-PowerShellVirtualDirectory
New-PowerShellVirtualDirectory -Server <AffectedServer> -Name PowerShell
Get-PowerShellVirtualDirectory -Server <AffectedServer> | Set-PowerShellVirtualDirectory -BasicAuthentication:$false
IISReset

After virtual directory re-creation I have checked its modules in IIS and made sure, that Kerberos module is native and the path to its DLL is correct:

modules

Root cause 3:

There is a missing Windows feature WinRM IIS extension.The full description is here: http://technet.microsoft.com/en-us/library/dd759166.aspx This was the case in my lab and I feel it is the side effect of in-place upgrade of OS from Windows server 2012 to Windows Server 2012 R2 on Exchange server (Yes I know it is not good idea, but how to learn non standard issues in other way). Here is simple solution: Install this windows feature:

Get-WindowsFeature *IIS* #to check if it is installed
Add-WindowsFeature Winrm-IIS-Ext # to install

winrmext

Exchange 2013 SP1, Exchange 2010 SP3 RU5 and Exchange 2007 SP3 RU13 out

You can download it here:

Exchange 2013 SP1: http://www.microsoft.com/en-us/download/details.aspx?id=41994

Exchange 2010 SP3 RU5: http://www.microsoft.com/en-us/download/details.aspx?id=42001

Exchange 2007 SP3 RU13: http://www.microsoft.com/en-us/download/details.aspx?id=41995

RTF content archiving problem when using Mailstore against Exchange 2010 SPx – ErrorInternalServerTransientError

I have experienced problem in one of my customer´s Exchange environment after utilization of Mailstore archiving software. Mailstore is EWS and client based archiving solution for Exchange. All best practice configuration steps can be found here: http://en.help.mailstore.com/MailStore_Help

Environment:

  • Virtualized Exchange 2010 SP3 RUx environment with 2 node DAG, multirole servers. Both running on ESX 5.1. No Firewall and router between production Exchange and Mailstore virtual servers.

Symptoms:

  • RTF content messages cannot be archived using Mailstore via EWS
  • RTF messages can be easily simulated as new meeting request containing inline picture of any size. Meetings should not be answered to have error visible in 100 percent of cases
  • Error message in Mailstore job log as follows
08:36:58.874 [18] INFO Processing message: 23.1.2014 7:42:45 UTC 'FW: Problém s archivací meetingů', UID 1: @mail.domain.cz, UID 2: 
08:36:58.890 [18] INFO Retrieving message...
08:36:58.890 [18] INFO Sending EWS Request (GetMimeContent)
08:36:59.561 [18] INFO Sending EWS Request (GetMimeContent)
08:37:00.403 [18] INFO Sending EWS Request (GetMimeContent)
08:37:01.464 [18] INFO Sending EWS Request (GetMimeContent)
08:37:02.727 [18] INFO Sending EWS Request (GetMimeContent)
08:37:04.194 [18] INFO Sending EWS Request (GetMimeContent)
08:37:05.879 [18] INFO Sending EWS Request (GetMimeContent)
08:37:07.751 [18] INFO Sending EWS Request (GetMimeContent)
08:37:09.825 [18] INFO Sending EWS Request (GetMimeContent)
08:37:12.072 [18] INFO Sending EWS Request (GetMimeContent)
08:37:14.521 [18] INFO Sending EWS Request (GetMimeContent)
08:37:17.173 [18] INFO Sending EWS Request (GetMimeContent)
08:37:20.012 [18] INFO Sending EWS Request (GetMimeContent)
08:37:23.070 [18] INFO Sending EWS Request (GetMimeContent)
08:37:26.330 [18] INFO Sending EWS Request (GetMimeContent)
08:37:29.793 [18] INFO Sending EWS Request (GetMimeContent)
08:37:30.230 [18] EXCEPTION MailboxImportWorker:ProcessMailboxMessageWrapper
: Microsoft Exchange Server nedokázal dokončit úlohu. Detaily: An internal server error occurred. Try again later. EWS Error Kód: ErrorInternalServerTransientError.
  • Moving node to other ESX cluster or moving active database to another node solved error instantly, but after switch back error appeared again
  • User-generated load was also partly the problem

Solution:

We have tried everything from re-creation of throttling policies, moving databases between nodes, updates to latest RU and Mailstore versions, Disabling TCP chimney, RSS and AutoTuning features, re-creation of Exchange databases, re-creation of Mailstore database and many many others.

What has finally helped was to re-create EWS virtual directory and restart IIS:

Get-WebServicesVirtualDirectory SERVER\ID | Remove-WebServicesVirtualDirectory
New-WebServicesVirtualDirectory
Get-WebServicesVirtualDirectory SERVER\ID | Set-WebServicesVirtualDirectory -InternalURL <IURL> -ExternalURL <EURL>

I suspect 2 things. 1 is problematic IIS 7 metabase or utilization of CGI (Common Gateway Interface –http://technet.microsoft.com/en-us/library/cc753077(v=ws.10).aspx ) on EWS virtual directory. Uninstallation of CGI did not solve the problem. Problem has been solved by re-cration of EWS virtual directory on affected DAG node after uninstallation of CGI.