Processing (Guidelines) Specifications
Processing Options
Below are Insight's processing options and details for you to consider.
Catalyst can de-duplicate files either by custodian or globally against the entire file population (or not at all). Duplicate files are identified based on hash values which are determined using the SHA-1 algorithm which applies a unique hash value to files as they are processed into our database. Once the files are encrypted using this method, the duplicates are then identified by matching those files with files that have equivalent hash values. If files have the same hash values, we are then able to tell that a file is a duplicate. When a duplicate is found, the custodian and hash values are recorded, and the OtherCustodians field is updated on the site when there is a bulk update to remove the dupes. What we do with the duplicate files is based upon our client's requirements as follows:
Files de-duped by custodian are removed from your site except for the first copy. At your request, we can provide a cross-linking report.
Files de-duped globally are removed except for the first copy. At your request, we can provide a cross-linking report.
If no de-duplication is selected, a single instance of a document (loose document or email message) is stored on your site. At your request, we can provide a cross-linking report.
However, hashing of email files is unique in that hash values are applied only to the following fields:
From
To
CC
BCC
Subject
Attachments
Date Sent
Body
When we convert email files to HTML for display, we can convert the display times to your preferred time zone setting. We can also adjust the time displayed in the metadata fields to show a preferred time zone for all emails.
By default we convert the time shown in email files to Mountain Time.
Control Numbers: Global or by Custodian
There are three options for numbering. You can number files globally (one prefix across all files), by custodian (one prefix per custodian) or using the Catalyst default (a five-digit job ID plus an eight-digit file identification number, 13 digits total).
For control numbers, we support up to 20 characters (letter or alpha) with no spaces or punctuation. We can also include a static suffix to control numbers.
At the inventory phase, you can pre-filter files prior to processing them by various combinations of fielded information including by custodian, date ranges (for non-email files only), file extensions and type.
More extensive searches based on metadata and text from individual items can be run once the files are loaded into your site.
We can extract embedded files from email and Microsoft Office documents. If the embedded object is a container file, we will extract its contents recursively until all units are extracted.
You can choose not to extract bmp or gif images or other objects that are for display only from emails. The images will be displayed as part of the HTML representation of the email. Office documents and other non-image files will be extracted.
Embedded files are shown as children to the parent document unless they are contained in a document that is itself an attachment to another file. In that case, all files will be shown as children (i.e. related/attached) to the ultimate parent.
Email containers (PST, MBOX, NSF) that are embedded will be processed as top-level files.
After initial container files are extracted, Catalyst creates a hash value for every file it has received using the SHA-1 algorithm. The files are then compared to the National Software Reference Library (NIST) list of system and program files. Those that match the NIST list can be removed from the population as system or program files. If a system file is attached to an email or is contained in a ZIP file, a placeholder file will be generated and the system file removed.
NIST file removal is not included on the exception report.
Even if not on the NIST list, we remove files with extensions commonly associated with system files. The extensions we remove when we find them are:
ADE—Microsoft Access Project Extension
ADP—Microsoft Access Project
BAS—Visual Basic Class Module
BAT—Batch File
CHM—Compiled HTML Help File
CMD—Windows NT Command Script
COM—MS-DOS Application
CPL—Control Panel Extension
CRT—Security Certificate
DLL—Dynamic Link Library
EXE—Application
HLP—Windows Help File
HTA—HTML Applications
INF—Setup Information File
INS—Internet Communication Settings
ISP—Internet Communication Settings
JS—JScript File
JSE—JScript Encoded Script File
LNK—Shortcut
MSI—Windows Installer Package
MSP—Windows Installer Patch
MST—Visual Test Source File
OCX—ActiveX Objects
PCD—Photo CD Image
PIF—Shortcut to MS-DOS Program
REG—Registration Entries
SCR—Screen Saver
SCT—Windows Script Component
SHB—Document Shortcut File
SHS—Shell Scrap Object
SYS—System Config/Driver
URL—Internet Shortcut (Uniform Resource Locator)
VB—VBScript File
VBE—VBScript Encoded Script File
VBS—VBScript Script File
WSC—Windows Script Component
WSF—Windows Script File
WSH—Windows Scripting Host Settings File
System file removal is not included on the exception report.
Optional Files to Include or Exclude
You may specify additional files to remove or include during processing.
NOTE: If you specify files to include, no other files will be processed. If you specify files to exclude, the remaining files will be processed.
You can list the file extensions to be included or excluded. Our system analyzes each file to determine its format and bases its action on that analysis rather than just the file extension.
Though this list isn’t comprehensive, here are some files to consider including in processing:
.bak, .csv, .doc, .docx, .eml, .emlx, .gzip, .htm, .jpg, .prj, .pps, .ppt, .pptx, .pst, .rar, .rtf, .tar, .txt, .lfp, .mbox, .mdb, .msg, .nsf, .ost, .pdf, .wbk, .wks, .wpd, .xls, .xlsx, .zip
Keep or Remove Native MSG Files
You can choose to keep or remove the native .msg files from the site. We create HTML versions of the e-mail message as well as any attachments. By excluding the original .msg file, you can reduce the volume being stored on the site.
If you choose to remove the .msg file, you will not be able to view it on the site. The HTML version will instead be treated as the native file.
Hash Email Files by Selected Fields or by Standard HTML Preview
We can adjust the fields used to hash email files to use certain fields, or we can create hash values based on the content of the HTML preview. See the De-Duplication section for the list of the standard email fields that are hashed.