Processing Email Files

We process PST, MBOX and NSF files in English or in other languages. PST files are containers typically used to hold files taken from an Exchange server or from Outlook. MBOX is a generic name for a family of file formats typically containing email (used for many Internet mail formats including Outlook Express). NSF files are containers holding Notes objects including Notes email.

PST files are normally delivered in Unicode. If you have PST or other mail files that are not Unicode based, please make sure we know in advance or that you have set up the Automated FTP site to accommodate your needs.

Email DocDate

Catalyst determines email DocDates according to the below:

Email = DocDate = SentDate

Loose Files = DocDate = LastModifiedDate

Calendar = DocDate = StartDate

Outlook Contact = DocDate = LastModDate

Extracting MSG and MBOX Files

Catalyst extracts individual items (email, calendar, contact, tasks, journal entries) and their related metadata from the individual PST or MBOX container. Metadata extracted from individual items includes the following fields:

MailFrom, MailTo, MailCC, MailBCC, MailSubject, Attachments, SentOn, StartDate, FolderFileName, CodePage, Priority, FromDisplay, FromEmail, FromSMTP, MessageType, EntryID, DeliveryTime, AttachmentCount, ConversationIndex, ConversationTopic, Sent (Y/N), CreationDate, ClientSubmitDate, LastModifiedDate, ToDisplay, ToEmail, ToSMTP, CcDisplay, CcEmail, CcSMTP, BccDisplay, BccEmail, BccSMTP, UnRead (Y/N)

Metadata is placed in an individual record, which can be expressed as an XML file or in other standard delimited format. We typically load some but not all of these fields as metadata fields into your site for a document record.

Attachments to individual MSG or mail files are extracted and processed separately. Metadata relating to each attachment is extracted and placed in a child record which is related to the parent MSG and shown as related/attachment. The record is linked to the extracted attachment item, which could be an Office document, file, MSG or a container file (e.g. ZIP, CAB, RAR, PST).

The processing system is recursive and will continue to extract attachments or files from containers until the unit is no longer divisible. In each case, a separate record is created with whatever metadata can be extracted along with a link to the individual file itself. PST and NSF files are an exception to this rule. They are processed separately as parent-level files.

Extracting Metadata from Office and Other Document Files

As a matter of course, we extract metadata from the Microsoft Office and other document files we process. Supported formats include files with the following extensions: DOC, XLS, XLT, PPS, PPT, MDB, MPP, POT, VSD, DOCX, XLSX, XLSM, PPTX, PPSX and PDF.

We typically extract the following metadata from Office and other document formats (where available):

Title, Author, Subject, Company, Comments, ApplicationName, Version Category, Keywords, Manager, LastSavedBy, WordCount, PageCount, ParagraphCount, LineCount, CharacterCount, CharacterCountWithSpaces, ByteCount, PresentationFormat SlideCount, NoteCount, HiddenSlideCount, MultimediaClipCount, DateCreated, DateLastPrinted, DateLastSaved, TotalEditTime, Template, DocumentSecurity, SharedDocument, RevisionNumber

As with emails, we typically only include a few of these possible fields as metadata fields.

Notes (NSF) File Processing

Automated can process Lotus Notes Files containing both English and foreign language content. We developed our own Notes processing software and use the internal Notes API to extract content rather than a third-party tool to convert Notes emails to Outlook format.

Our practice is to extract all notes objects from the file by traversing all possible views in the system. We then de-duplicate to remove copies of documents and other objects from the different views. We typically extract the following metadata from NSF files (where available):

MailFrom, MailTo, MailCc, MailBcc, MailSubject, Attachments, SentOn, MessageType, AttachmentCount, ConversationIndex, ConversationTopic, Sent, CreationDate, ClientSubmitDate, LastModifiedDate, StartDate, EntryID,  DeliveryTime, FolderFileName

Once processed, Notes attachments and container files are treated in the same fashion as PST files. Date and time zone options are also treated similarly.

We can exclude corrupt views from NSF processing. We also track items by unique ID so we can eliminate duplicate files obtained from processing different NSF views.

Converting Email Items to HTML

We render each email item (mail, contact, or task) to HTML for viewing. The rendering tries to be faithful to the original formatting, font and colors. However, emails are generated from a data store (database), and their formatting is dependent on the client viewer. They will look different depending on whether they are displayed in our HTML viewer or in Outlook or some other email viewer.

For example, an Outlook file will look different when opened with a different mail viewer such as Outlook Express. Most viewers offer different viewing settings that can be adjusted by the user. Our preview is designed to look similar to how an email might be viewed in Outlook but not necessarily identical.

Our preview for an email message shows the following fields:

From, To, CC, BCC, Subject, Sent Date, Attachments and Body