Processing (Guidelines) Specifications

Processing Options

Below are Insight's processing options and details for you to consider.

De-Duplication

Catalyst can de-duplicate files either by custodian or globally against the entire file population (or not at all). Duplicate files are identified based on hash values which are determined using the SHA-1 algorithm which applies a unique hash value to files as they are processed into our database. Once the files are encrypted using this method, the duplicates are then identified by matching those files with files that have equivalent hash values. If files have the same hash values, we are then able to tell that a file is a duplicate. When a duplicate is found, the custodian and hash values are recorded, and the OtherCustodians field is updated on the site when there is a bulk update to remove the dupes. What we do with the duplicate files is based upon our client's requirements as follows:

However, hashing of email files is unique in that hash values are applied only to the following fields:

Time Zone Processing Options

When we convert email files to HTML for display, we can convert the display times to your preferred time zone setting. We can also adjust the time displayed in the metadata fields to show a preferred time zone for all emails.

By default we convert the time shown in email files to Mountain Time.

Control Numbers: Global or by Custodian

There are three options for numbering. You can number files globally (one prefix across all files), by custodian (one prefix per custodian) or using the Catalyst default (a five-digit job ID plus an eight-digit file identification number, 13 digits total).

For control numbers, we support up to 20 characters (letter or alpha) with no spaces or punctuation. We can also include a static suffix to control numbers.

Pre-Filter by File Criteria

At the inventory phase, you can pre-filter files prior to processing them by various combinations of fielded information including by custodian, date ranges (for non-email files only), file extensions and type.

More extensive searches based on metadata and text from individual items can be run once the files are loaded into your site.

Embedded Objects

We can extract embedded files from email and Microsoft Office documents. If the embedded object is a container file, we will extract its contents recursively until all units are extracted.

You can choose not to extract bmp or gif images or other objects that are for display only from emails. The images will be displayed as part of the HTML representation of the email. Office documents and other non-image files will be extracted.

Embedded files are shown as children to the parent document unless they are contained in a document that is itself an attachment to another file. In that case, all files will be shown as children (i.e. related/attached) to the ultimate parent.

Email containers (PST, MBOX, NSF) that are embedded will be processed as top-level files.

NIST File Removal

After initial container files are extracted, Catalyst creates a hash value for every file it has received using the SHA-1 algorithm. The files are then compared to the National Software Reference Library (NIST) list of system and program files. Those that match the NIST list can be removed from the population as system or program files. If a system file is attached to an email or is contained in a ZIP file, a placeholder file will be generated and the system file removed.

NIST file removal is not included on the exception report.

Other System File Removal

Even if not on the NIST list, we remove files with extensions commonly associated with system files. The extensions we remove when we find them are:

System file removal is not included on the exception report.

Optional Files to Include or Exclude

You may specify additional files to remove or include during processing.

NOTE: If you specify files to include, no other files will be processed. If you specify files to exclude, the remaining files will be processed.

You can list the file extensions to be included or excluded. Our system analyzes each file to determine its format and bases its action on that analysis rather than just the file extension.

Though this list isn’t comprehensive, here are some files to consider including in processing:

.bak, .csv, .doc, .docx, .eml, .emlx, .gzip, .htm, .jpg, .prj, .pps, .ppt, .pptx, .pst, .rar, .rtf, .tar, .txt, .lfp, .mbox, .mdb, .msg, .nsf, .ost, .pdf, .wbk, .wks, .wpd, .xls, .xlsx, .zip

Keep or Remove Native MSG Files

You can choose to keep or remove the native .msg files from the site. We create HTML versions of the e-mail message as well as any attachments. By excluding the original .msg file, you can reduce the volume being stored on the site.

If you choose to remove the .msg file, you will not be able to view it on the site. The HTML version will instead be treated as the native file.

Hash Email Files by Selected Fields or by Standard HTML Preview

We can adjust the fields used to hash email files to use certain fields, or we can create hash values based on the content of the HTML preview. See the De-Duplication section for the list of the standard email fields that are hashed.