CJK Searching Tips

Chinese, Japanese and Korean (CJK) Character Searching

When creating searches for CJK records, remember these few things:

Use BodyText. when you want to design the most inclusive search (cast the widest net). The text (body) of a document is tokenized for CJK searching. This means the tokenizer puts spaces between characters where appropriate. Remember that BodyText searching includes many of the fields of emails (for example, Subject, From, To).
Do not use AnyText searching as your characters may match non-tokenized characters in a metadata field and, therefore, return unexpected results. While the text of a record is tokenized, data/information in most metadata fields is not. When you type a value directly into the Search box, it is automatically an AnyText search, which searches all of the fields and the body text of the documents at once. While this is our fastest, easiest method of searching for a value when using non-CJK characters, do not use AnyText searching for CJK characters. See our BodyText topic for more information.

In CJK language searching, your best results are returned when you select the appropriate language (CJK languages require tokenization). English is not a language that requires tokenization. By default, the Language menu is set to English unless a different language is requested at the time of site setup. Use the Language menu and select the language of your search values. This tells the system to insert the necessary spaces created by tokenization. See Advanced Settings for more information on search languages.
The Tracked Search mode is very helpful when searching CJK languages, as the queries you build here are BodyText. Try searching with wildcards (i.e., Begins With, Contains) to explore words that have some of the characters. See the Tracked Search section for more information.
Proximity searching (also known as near or within searching) is allowable when building CJK queries. Remember to not use AnyText searching. Perhaps you want to find the Japanese word computer within two words of the word program.

"コンピューター" near/2 "プログラム" is an AnyText search, and we discourage building queries in this way.

The better search is BodyText: "コンピューター" near/2 BodyText: "プログラム".

This is also true of Chinese and Korean terms:

"电脑" near/2 "程序"

"컴퓨터" near/2 "프로그램"

Should be constructed as:

BodyText: "电脑" near/2 BodyText: "程序" in Chinese, and

BodyText: "컴퓨터" near/2 BodyText: "프로그램" in Korean.

For additional information, refer to Proximity Operators in the Operators topic.