Document Format Guidelines

Native Files

Catalyst will attempt to index the text in a native file, so no text file is required. If this indexing is successful, search hit highlighting will be available on the preview of the native.

Compressed files must be extracted and processed prior to uploading. System and container files cannot be indexed or viewed through the site.

Metadata should be extracted from the native files and submitted with a corresponding load file.

OCR or Text Files

When delivering files that are searchable in the native format (.DOC, .XLS, etc.), extracted text files are not required because Catalyst will index the native file. If the files do not contain any extractable text (.TIF, .JPG, etc.), then OCR text files are needed in order to make the documents searchable within Insight. Requirements for delivering text files are as follows:

OCR or extracted text must be submitted as separate text files. Load files should not contain OCR or extracted text (i.e., text should not be submitted in a field within a Concordance DAT file).
Text files must be in multi-page format (one text file per document, not one text file per page). Catalyst cannot accept single page text files.
Page breaks in the text files are preferred but not necessary.
Text files must have UTF-8 encoding to ensure proper indexing.

TIFF Files

Catalyst Insight accepts single-page TIFF files. In order for TIFF files to be searchable, OCR text files must also be delivered. Single-page TIFF files must be loaded manually and cannot be loaded via the Automated system. These files must be accompanied with an additional load file, either an IPRO .LFP file or an Opticon .OPT file to indicate the document breaks. Multi-page text files must be delivered with the single- page TIFF files (single page text cannot be loaded). The text files should be named to match the first page of each document, such as ABC001.TIF.TXT. If the text files do not contain the full TIFF file name (including the .TIF extension) plus .TXT then the files will be indexed but not visible on the site. Regardless of file naming convention, the text files must be delivered within the same folder as the image files. The associated text should not be included within the load file.

PDF Files

All PDF files must be optimized for fast web viewing or “linearized.” There are three types of PDF files, with unique instructions for each:

PDFs with embedded text: This is the Catalyst preferred format for images. These PDFs have embedded text in them. They are created either from scanned images run through an OCR process or created from an electronic source file.
PDFs with associated OCR text files: In this format, the OCR text is delivered in a separate file with the same name as the PDF file (ABC001.PDF and ABC001.TXT, for example).
PDFs with embedded images only: PDFs without embedded text or associated OCR text files will not have searchable text. A user will have to rely on the searching of metadata in order to find these documents in the repository.

Multi-Language Documents

Multi-language documents must also be in UTF-8 format.

Delivery of Coding for Documents

When delivering document coding to be loaded to the site for fields where the data will be mapped to radio buttons, checkboxes, drop down lists or multi-select fields, there are specific formatting requirements. These requirements are as follows and only apply to editable fields:

The values in the data should EXACTLY match the facet values for the fields on the site. This includes matching case and punctuation.
If a value doesn’t match one the existing search facets then it may or may not display on the site. The new value will not be added the field and may cause problems in searching on that field.
Multi-valued data must use the semicolon to separate values within a field.