Duplicates are an annoying, but altogether not infrequent side-effect of the POP3 protocol — Lets discuss how and why they occur.
Avoiding duplicates is fairly easy, but unfortunately many many clients fail to implement a robust algorithm, resulting in the potential for duplicates nearly any time something unexpected happens, be it a timeout, client crash, internet connection disconnect, server shutdown, or other unexpected incident.
A bit of background, each time a POP3 client connects to the server, the POP3 client knows only the number of messages available, and the total size of those messages.
Mail clients download one message at a time using a number assigned to each message. However, there is no guarantee that the order of messages will be consistent from one session to another, especially if messages have been added or deleted. As a result, there is a tendency among mail client authors to just download everything, rather then going through the slightly more complex process of determining which messages are actually new, and which are not.
There are three POP3 commands a client can try to keep track of what messages it has downloaded, TOP (to download headers), LAST and UIDL.
TOP downloads the entire headers plus an optional number of lines from the body, from a single message at a time, which is usually enough information for a mail client to determine whether or not it has seen a message before. However, this is a very wasteful technique, especially on an even remotely large mailbox.
Next up is the LAST command. This was once popular, but today I am only aware of a couple mail clients that still use it. The LAST command allows a client to track the last message downloaded.
Internally in MDaemon, LAST remembers the filename of the latest message which was downloaded, and will figure out the appropriate message number each time the command is called. This feature is interesting in that it can help one mail client know if another mail client has already downloaded a message, while still leaving mail on the server. Not many clients actually use this functionality though.
The third, and most popular is the UIDL command. Today, most clients use UIDL, which looks something like this:
The UIDL command lists each message by number followed by a string which is guaranteed to not change between sessions. This allows the mail client to build an index of messages and easily determine which are new, and which were seen before.
MDaemon constructs the UIDL results using the message name, date stamp, size, and a few other details about the messages. As a result, if a message is modified on the server, it will appear as “new” to mail clients even if you don’t rename it.
Armed with an index of messages, it is then up to the client to track which messages it has seen before and which it has not.
So, how does a mail client avoid duplicates? Well, it’s actually pretty simple. Before downloading any messages, issue a UIDL command and note each message in a local index. This data should be maintained even after a mail client has attempted to delete a message, at least until the client has established another POP3 session to confirm the messages were actually deleted.
So you may be asking yourself why any of this is an issue at all, why not just delete messages as soon as they’re downloaded?
Well, most clients do, if you look at the logs you’ll often see a DELE command after each message is downloaded. However, if the POP3 session ends in any method other then the client issuing a QUIT command, the POP3 protocol requires the mail server to leave the mailbox in it’s original state. This means that when a session disconnects unexpectedly, half-deleted messages are not actually removed, and instead are still waiting on the server for the next time a POP3 client connects.
As it turns out, in many modern mail clients, you can actually avoid duplicates by setting your mail client to “Leave messages on server” and “Remove mail from server after ‘1’ day” — This combination ensures that your mail client must maintain a list of downloaded messages between sessions, but still removes messages in short order, preventing mailboxes from getting overwhelmed.