Text Extraction and Indexing Policy

Insight extracts and indexes searchable text from a wide range of file formats but cannot extract text in all circumstances. This policy is written to advise users as to Insight’s capability to extract and index text from files loaded into our system. See the Indexing Exceptions table for a full list of the indexing exceptions codes.

Acceptable File Formats

Insight can extract and index text from over 300 common file and document formats including:

In general, Insight can extract and index from most text-based file formats. For a comprehensive list of supported file formats see Supported Formats.

As might be expected, Insight cannot extract text from non-text file formats such as image or container files. In addition, there are a number of other cases where we prevent certain files from being ingested which allows us to quickly process your data and build your site. To view the excluded files by exception type, click the applicable Indexing Exception report from the Reports module in Insight.

The following is an explanation for each of Insight's Indexing exceptions.

Reports Module: Indexing Exceptions Links

When selected, the reports module displays the available reports which are linked in the navigation panel. The following is a brief description of each of the different Indexing Exceptions reports that can be run in Insight.

 

Insight Indexing Exceptions Reports

 

File not found

Documents where the file that is to be used for indexing is missing from the upload. To search for files that fall into the “File not found” group of exception-codes from the Free-Form Search-screen, users may enter the following search syntax into the Free Form search dialogue box: OrbError = [200]. This search query will return all of the documents that matched this code.

File too large

Insight will not index files that are deemed too large for efficient indexing. These files often consist of log files, large Excel files and large reports. Because the files can be detrimental to the performance of Insight’s indexing engine, we do not extract text from:

To search for files that fall under the “File too large” error codes, search for (OrbError = [101 102 107 111 112 206]).

File password protected

This is for documents where our text extraction process determined the file is password protected and so the text cannot be extracted: (OrbError = 103).

No searchable text

As noted, certain files do not have any text to extract. We classify these documents in two different ways:

To search for files that fall into the “No searchable text” error codes from the Free Form Search screen, search for OrbError = [105 108].

File contains excessive punctuation

Documents that fall under the "File contains excessive punctuation," indexing exception contain excessive punctuation marks as compared to the total number of terms. Examples of documents that may fall into this category are documents containing programming code, file listings, or other documents containing an excessive number of punctuation marks.

To search for files that fall into the “File contains excessive punctuation” error codes from search (OrbError = 113).

File contains excessive numbers

Documents that fall under the "File contains excessive numbers" indexing exceptions contain excessive numbers as compared to unique terms. Examples of documents that may fall into this category are spreadsheets, or other documents containing many numbers.

To search for files that fall into the “File contains excessive numbers” error codes search (OrbError = 114).

Content modified due to size

This link shows users documents with a large amount of text where we modify or "shrink" the amount of text by modifying the extracted text and removing certain non-alpha characters and/or removing duplicate words.

To search for files that fall into the “Content modified due to size” exception group, enter the following search syntax into the search dialogue box: indexissue:shrinkray

ATTN: This process is no longer used in the US as of March 22, 2018 (03-22-2018), so the "shrinkray" group of indexing exceptions will only apply to docs loaded into Insight prior to 03/22/2018 in the US. It is still used in Japan.

Other exceptions

To search for files that fall under the “Other exceptions” error codes from the Free Form Search screen, users may search for the exceptions using the following search: OrbError = [100 104 106 109 115 116 117 201 202 203 204] OR OrbError > 206. This is a catchall category that covers a number of indexing issues as follows:

Note: OrbError codes 111-117 do not apply to documents loaded in Japan.

To search for files that fall under the "Other exceptions" error codes category from the Free Form Search screen, users would search for: OrbError = [115 116 117]

Why Insight has exceptions for documents with extractable text

To provide users with a high-performance solution, we identified several classes of problem files where the extracted text contains an excessive number of unique terms or tokens. Often this is the result of files containing characters extracted from an encrypted file, a file containing print script, computer code or even bad OCR.

These files are problematic for Insight’s indexing engine because the extracted text can bloat or corrupt the index and dramatically slow down performance as indexes are updated. They are typically not used for keyword searching because the extracted text is essentially gibberish.

 

Indexing Exceptions Code List

OrbError

Reason

100

Couldn't extract text from this document - zero file length

101

File size exceeds 256MB

102

Extracted text is over 12MB

103

File is password protected

104

Couldn't extract text from this document (non-image files and non-container files)

105

No extractable content (image file or container file)

106

Text removed after four failed submits to Mark Logic

107

Extracted text is over 4MB (or could not be shrunk by "shrinkray" process)

108

No error, but the text extraction resulted in no text getting extracted (aka EmptyBodyText)

109

Text extraction took longer than 30 seconds

111

Extracted text is over 200KB, and the text doesn’t fit into one of the 112-117 categories

112

Extracted text is between 100KB and 200KB, and while it doesn’t fit into one of the 113-117 categories, it is close enough to one of those categories that the text is not allowed through

113

Files with excessive punctuation

114

Files with excessive numbers

115

Files with mostly unique terms that are not typically human generated files

116

Files with base64 strings

117

Files with certain non-Unicode characters that represent unusable text

200

No file to attempt text extraction from