Search and Index Essentials¶
This chapter will enable you to customize and extend searching operations for PDF document content and metadata, as well as indexing operations. The principal JavaScript objects used in searching and indexing are the search, catalog, and index objects. In this chapter we shall see how to use these objects.
Searching for text in PDF documents¶
JavaScript provides a static search object, which provides powerful searching capabilities that may be applied to PDF documents and indexes. Its properties and methods are described in the following tables.
Search properties
Property |
Description |
|---|---|
attachments |
Searches PDF attachments along with the base document. |
available |
Determines if searching is possible. |
docInfo |
Searches document metadata information. |
docText |
Searches document text. |
docXMP |
Searches document XMP metadata. |
bookmarks |
Searches document bookmarks. |
ignoreAccents |
Ignores accents and diactrics in search. |
ignoreAsianCharacterWidth |
Matches Kana characters in query. |
indexes |
Obtains all accessible |
jpegExif |
Searches EXIF data in associated JPEG images. |
markup |
Searches annotations. |
matchCase |
Determines whether query is case-sensitive. |
matchWholeWord |
Finds only occurrences of complete words. |
maxDocs |
Specifies the maximum number of documents returned. |
proximity |
Uses proximity in results ranking for |
proximityRange |
Specifies the range of proximity search in number of words. |
refine |
Uses previous results in query. |
stem |
Uses word stemming in searches. |
wordMatching |
Determines how words will be matched (phrase, all words, any words, Boolean query). |
Search methods¶
Method |
Description |
|---|---|
addIndex |
Adds an index to the list of searchable indexes. |
getIndexForPath |
Searches the index list according to a specified path. |
query |
Searches the document or index for specified text. |
removeIndex |
Removes an index from the list of searchable indexes. |
Finding words in an PDF document¶
The search object query method is used to search for text within a PDF document. It accepts three parameters:
cQuery: the search textcWhere: where to search:ActiveDoc: within the active documentFolder: within a specified folderIndex: within a specified indexActiveIndexes: within the active set of available indexes (the default)cDIPath: the path to a folder or index used in the search
The simplest type of search is applied to the text within the PDF document. For example, the following code performs a case-insensitive search for the word Acrobat within the current document:
search.query("Acrobat", "ActiveDoc");
Using advanced search options¶
You can set the search object properties to use advanced searching options, which can be used to determine how to match search strings, and whether to use proximity or stemming.
To determine how the words in the search string will be matched, set the search object wordMatching property to one of the following values:
MatchPhrase: match the exact phraseMatchAllWords: match all the words without regard to the order in which they appearMatchAnyWord: match any of the words in the search stringBooleanQuery: perform a Boolean query for multiple-document searches (the default)
For example, the following code matches the phrases “My Search” or “Search My”:
search.wordMatching = "MatchAllWords";
search.query("My Search");
To determine whether proximity is used in searches involving multiple documents or index definition files, set the search object wordMatching property to MatchAllWords and set its proximity property to true. In the example below, all instances of the words My and Search that are not separated by more than 900 words will be listed in the search:
search.wordMatching = "MatchAllWords";
search.proximity = true;
search.query("My Search");
To use stemming in the search, set the search object stem property to true. For example, the following search lists words that begin with “run”, such as “running” or “runs”:
search.stem = true;
search.query("run");
To specify that the search should only identify occurrences of complete words, set the search object matchWholeWord property to true. For example, the following code matches “nail”, but not “thumbnail” or “nails”:
search.matchWholeWord = true;
search.query("nail");
To set the maximum number of documents to be returned as part of a query, set the search object maxDocs property to the desired number (the default is 100). For example, the following code limits the number of documents to be searched to 5:
search.maxDocs = 5;
To refine the results of the previous query, set the search object refine property to true, as shown in the following code:
search.refine = true;
Searching across multiple PDF documents¶
This section discusses searches involving more than one PDF document.
Searching all PDF files in a specific location¶
To search all the PDF files within a folder, set the cWhere parameter in the search object query method to Folder. In the following example, all documents in /C/MyFolder will be searched for the word “Acrobat”:
search.query("Acrobat", "Folder", "/C/MyFolder");
Using advanced search options for multiple document searches¶
In addition to the advanced options for matching phrases, using stemming, and using proximity, it is also possible to specify whether the search should be case-sensitive, whether to match whole words, to set the maximum number of documents to be returned as part of a query, and whether to refine the results of the previous query.
To specify that a search should be case sensitive, set the search object matchCase property to true. For example, the following code matches “Acrobat” but not “acrobat”:
search.matchCase = true;
search.query("Acrobat", "Folder", "/C/MyFolder");
Searching PDF index files¶
A PDF index file often covers multiple PDF files, and the time required to search an index is much less than that required to search each of the corresponding individual PDF files.
To search a PDF index, set the cWhere parameter in the search object’s query method to Index. In the following example, myIndex is searched for the word “Acrobat”:
search.query("Acrobat", "Index", "/C/MyIndex.pdx");
Using Boolean queries¶
You can perform a Boolean query when searching multiple document or index files. Boolean queries use the following operations as logical connectors:
ANDOR^(exclusive or)NOT
For example, the phrase "Paris AND France" used in a search would return all documents containing both the words Paris and France.
The phrase "Paris OR France" used in a search would return all documents containing one or both of the words Paris and France.
The phrase "Paris ^ France" used in a search would return all documents containing exactly one (not both) of the words Paris and France.
The phrase "Paris NOT France" used in a search would return all documents containing Paris that do not contain the word France.
In addition, parentheses may be used. For example, the phrase "Acrobat AND (Standard OR Professional OR Pro)". The result of this query would return all documents that contain the word “Acrobat” and either “Standard”, “Professional” or “Pro” in it.
search.wordMatching="BooleanQuery";
search.query("Acrobat AND (Standard OR Professional OR Pro)
", "Folder",
"/C/MyFolder");
To specify that a Boolean query will be used, be sure that the search object wordMatching property is set to BooleanQuery (which is the default).
Indexing multiple PDF documents¶
It is possible to extend and customize indexes for multiple PDF documents using the JavaScript catalog, catalogJob, and index objects. These objects may be used to build, retrieve, or remove indexes. The index object represents a catalog -generated index, contains a build method that is used to create an index (and returns a catalogJob object containing information about the index), and has the properties shown below the following table.
Index properties
Property |
Description |
|---|---|
available |
Indicates whether an index is available |
name |
The name of the index |
path |
The device-independent path of the index |
selected |
Indicates whether the index will participate in the search |
The catalog object may be used to manage indexing jobs and retrieve indexes. It contains a getIndex method for retrieving an index, a remove method for removing a pending indexing job, and properties containing information about indexing jobs.
Creating, updating, or rebuilding indexes¶
To determine which indexes are available, use the search object indexes property, which contains an array of index objects. For each object in the array, you can determine its name by using its name property. In the code below, the names and paths of all available selected indexes are printed to the console:
var arr = search.indexes;
for (var i=0; i<arr.length; i++)
{
if (arr[i].selected)
{
var str = "Index[" + i + "] = " + arr[i].name;
str += "nPath = " + arr[i].path;
console.println(str);
}
}
To build an index, first invoke the catalog object getIndex method to retrieve the index object. This method accepts a parameter containing the path of the index object. Then invoke the index object build method, which returns a catalogJob object. The method accepts two parameters:
cExpr: a JavaScript expression executed once the build operation is completebRebuildAll: indicates whether to perform a clean build in which the existing index is first deleted and then completely built
Finally, the returned catalogJob object contains three properties providing useful information about the indexing job:
path: the device-independent path of the indextype: the type of indexing operation (Build,Rebuild, orDelete)status: the status of the indexing operation (Pending,Processing,Completed, orCompletedWithErrors)
In the code shown below, the index myIndex is completely rebuilt, after which its status is reported:
// Retrieve the Index object
var idx = catalog.getIndex("/C/myIndex.pdx");
// Build the index
var job = idx.build("app.alert('Index build');", true);
// Confirm the path of the rebuilt index:
console.println("Path of rebuilt index: " + job.path);
// Confirm that the index was rebuilt:
console.println("Type of operation: " + job.type);
// Report the job status
console.println("Status: " + job.status);
Searching metadata¶
PDF documents contain document metadata in XML format, which includes information such as the document title, subject, author’s name, keywords, copyright information, date modified, file size, and file name and location path.
To use JavaScript to search a document’s XMP metadata, set the search object’s docXMP property to true, as shown in the following code:
search.docXMP = true;