LayerPD_Layer
ObjectPDWordFinder

A PDWordFinder extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document. The core API provides methods to extract words from a document, obtain information on the word finder, and to release a list of words after a plug-in is done using it.

To create a word finder, use PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDDocCreateWordFinderEx() is a version 6.0 replacement for PDDocCreateWordFinder() and PDDocCreateWordFinderUCS() that adds configurable word-breaking behavior.

There are two primary methods of using word finders:



Define Summary
 Define
 WF_LATEST_VERSION
Used to obtain the latest available version.
Typedef Summary
 Typedef
 PDWordFinder
Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.
 PDWordFinderConfig
 PDWordFinderConfigRec
Structure Summary
 Structure
 _t_PDWordFinderConfig
A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false.
Callback Summary
 Callback
 PDWordFinderCtrlProc
This is passed to PDWordFinderSetCtrlProc().
Method Summary
 Method
 
void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)
Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
 
void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)
Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
 
Destroys a word finder. Use this when you are done extracting text in a file.
 
ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void* clientData)
Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
 
ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void* clientData)
Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
 
ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val* ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void* clientData)
Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.
 
Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.
 
Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().
 
Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.
Defines Detail
WF_LATEST_VERSION 
Product availability: All
Platform availability: All

Syntax

#define WF_LATEST_VERSION 0

Description

Used to obtain the latest available version.


File: PDExpT.h
Line: 3562

Typedefs Detail
PDWordFinder 
Product availability: All
Platform availability: All

Syntax

typedef struct _t_PDWordFinder* PDWordFinder;

Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.

See Also


File: PDExpT.h
Line: 3294
PDWordFinderConfig 
Product availability: All
Platform availability: All

Syntax

typedef _t_PDWordFinderConfig PDWordFinderConfig;

File: PDExpT.h
Line: 3806
PDWordFinderConfigRec 
Product availability: All
Platform availability: All

Syntax

typedef _t_PDWordFinderConfig PDWordFinderConfigRec;

File: PDExpT.h
Line: 3806


Structure Detail
_t_PDWordFinderConfig
Product availability: All
Platform availability: All

Syntax

A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false.

See Also


File: PDExpT.h
Line: 3607

Elements
recSize  

This is always sizeof(PDWordFinderConfigRec).

 
disableTaggedPDF  

When true, it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version (WF_LATEST_VERSION).

 
noXYSort  

When true, it disables generating an XY-ordered word list. This option replaces the sort order flags in the older version of the word finder creation command (PDDocCreateWordFinder()). Setting this option is equivalent to omitting the WXE_XY_SORT flag.

 
preserveSpaces  

When true, the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When false (the default), you can add spaces later by considering the word attribute flag WXE_ADJACENT_TO_SPACE, but there is no way to restore the exact number of consecutive space characters.

 
noLigatureExp  

When true, and the font has a ToUnicode table, it disables the expansion of ligatures using the default ligatures. The default ligatures are:

  • fi

  • ff

  • fl

  • ffi

  • ffl

  • st

  • oe

  • OE

When noLigatureExp is true and the font does not have a ToUnicode table, the ligature is expanded based on whether there is a representation of the ligature in the defined codePage. If there is no representation, the ligature is expanded; otherwise, the ligature is not expanded.

 
noEncodingGuess  

When true, it disables guessing encoding of fonts that have unknown or custom encoding when there is no ToUnicode table. Inappropriate encoding conversions can cause the word finder to mistakenly recognize non-Roman single-byte fonts as Standard Roman encoding fonts and extract the text in an unusable format. When this option is selected, the word finder avoids such unreliable encoding conversions and tries to provide the original characters without any encoding conversion for a client with its own encoding handling. Use the PDWordGetCharEncFlags() method to detect such characters.

 
unknownToStdEnc  

When true, it assumes any font with unknown or custom encoding to be Standard Roman. This option overrides the noEncodingGuess option.

 
ignoreCharGaps  

When true, it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF.

 
ignoreLineGaps  

When true, it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF.

 
noAnnots  

When true, it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box.

 
noHyphenDetection  

When true, it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused.

 
trustNBSpace  

When true, it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused.

 
noExtCharOffset  

When true, it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list.

 
noStyleInfo  

When true, it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot use PDWordGetNthCharStyle() and PDWordGetStyleTransition() with the output of the word finder.

 
decomposeTbl  

A custom UTF-16 decomposition table. This table can be used to expand Unicode ligatures not included in the default ligature list. Each decomposition record contains a UTF-16 character code (either a 16-bit or 32-bit surrogate), a replacement UTF16 string, and the delimiter 0x0000.

 
decomposeTblSize  

The size of the decomposeTbl in bytes.

 
charTypeTbl  

A custom character type table to enhance word breaking quality. Each character type record contains a region start value, a region end value, and a character type flag as defined in PDExpT.h. A character code is in UTF-16, and is either a 16-bit or a 32-bit surrogate.

 
charTypeTblSize  

The size of the charTypeTbl in bytes.

 
preserveRedundantChars  

When true, it disables detecting and removing redundant characters. Some PDF pages have the same text drawn multiple times on the same spot to get a special visual effect. Normally, those redundant characters are removed from the word finder output.

Since this option may leave extra characters with overlapping bounding boxes, using it together with the disableCharReordering option is recommended for more consistent text extraction results.

 
disableCharReordering  

When true, it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering (disableCharReordering = true) may produce a more static result.

Callbacks Detail
PDWordFinderCtrlProc 
Product availability: All
Platform availability: All

Syntax

ASBool (*PDWordFinderCtrlProc)(ASUns32 startTime, void *clientData)

This is passed to PDWordFinderSetCtrlProc().

This is the callback function called by Word Finder when its page enumeration process takes longer than the specified time (in seconds). Return true to continue the enumeration process, or false to stop. startTime is the value that was set by ASGetSecs() when the Word Finder started processing the current page.


File: PDExpT.h
Line: 3814

Method Detail
PDWordFinderAcquireVisibleWordList()
Product availability: All
Platform availability: All

Syntax

void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)

Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.

The list contains only words that are visible in the given context. If the word states change in the given context, the word list will have to be released and re-acquired to reflect the changed set of visible words.

There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.

Use PDWordFinderEnumWords() instead of this method if you wish to find one word at a time instead of obtaining a table containing all visible words on a page.

This procedure is intended to replace the call to PDWordFinderAcquireWordList() in most cases where you want to work only with the content that is visible on screen (such as a text selection). Change this call to update an application to work with the Optional Content feature.

Parameters

wObj — 

The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.

 
pgNum — 

The page number for which words are found. First page is 0, not 1 as designated in Acrobat.

 
ocContext — 

The context within which the words are in a visible state. NULL is equivalent to passing PDDocGetOCContext(pdDoc).

 
wInfoP — 

(Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly.

Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS().

 
xySortTable — 

(Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non-NULL, the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder().

 
rdOrderTable — 

Currently unused. Pass NULL for its value.

 
numWords — 

(Filled by the method) The number of visible words found on the page.

See Also

Exceptions

pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00060000

File: PDProcs.h
Line: 10506
PDWordFinderAcquireWordList() 
Product availability: All
Platform availability: All

Syntax

void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)

Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.

Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.

There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.

Use PDWordFinderEnumWords() instead of this method, if you wish to find one word at a time instead of obtaining a table containing all words on a page.

Parameters

wObj — 

The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.

 
pgNum — 

The page number for which words are found. The first page is 0, not 1 as designated in Acrobat.

 
wInfoP — 

(Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly.

Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord() to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS().

 
xySortTable — 

(Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, are followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non-NULL, the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder().

 
rdOrderTable — 

Currently unused. Pass NULL for this value.

 
numWords — 

(Filled by the method) The number of words found on the page.

See Also

Exceptions

pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h
Line: 4691
PDWordFinderDestroy() 
Product availability: All
Platform availability: All

Syntax

void PDWordFinderDestroy(PDWordFinder wObj)

Destroys a word finder. Use this when you are done extracting text in a file.

Parameters

wObj — 

IN/OUT The word finder to destroy.

See Also

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h
Line: 4737
PDWordFinderEnumVisibleWords() 
Product availability: All
Platform availability: All

Syntax

ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void* clientData)

Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.

Only words that are visible in the given optional-content context are enumerated.

Parameters

wObj — 

A word finder object.

 
PageNum — 

The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.

 
ocContext — 

The context within which the words are in a visible state. NULL is equivalent to passing PDDocGetOCContext(pdDoc).

 
wordProc — 

A user-supplied callback to call once for each word found. Enumeration halts if wordProc returns false.

 
clientData — 

A pointer to user-supplied data to pass to wordProc each time it is called.

Returns

true if enumeration was successfully completed, false if enumeration was terminated because wordProc returned false.

See Also

Exceptions

genErrBadParm is raised if wordProc is NULL, or pageNum is less than zero or greater than the total number of pages in the document.
pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00060000

File: PDProcs.h
Line: 10578
PDWordFinderEnumWords() 
Product availability: All
Platform availability: All

Syntax

ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void* clientData)

Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.

Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.

Parameters

wObj — 

A word finder object.

 
PageNum — 

The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.

 
wordProc — 

A user-supplied callback to call once for each word found. Enumeration halts if wordProc returns false.

 
clientData — 

A pointer to user-supplied data to pass to wordProc each time it is called.

Returns

true if enumeration was successfully completed, false if enumeration was terminated because wordProc returned false.

See Also

Exceptions

genErrBadParm is raised if wordProc is NULL, or pageNum is less than zero or greater than the total number of pages in the document.
pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h
Line: 4773
PDWordFinderEnumWordsStr() 
Product availability: All
Platform availability: All

Syntax

ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val* ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void* clientData)

Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.

The words extracted by this method do not have quads, text style, or text selection information. The character offset is calculated from the beginning of the input string, and is increased by 2 on every 16 bits of data (the character offset of a character in a PDWord is the byte offset of the character in the source Unicode string).

Parameters

wObj — 

A word finder object.

 
ucsStr — 

A pointer to the Unicode string.

 
strLen — 

The length of the string in bytes.

 
charOffsetAdj — 

The character offset value of the first character in the input Unicode string. This value is added to the word character offsets, and is used to maintain contiguous word character offsets when multiple strings (and multiple calls to this method) are combined into one word list.

For example:

PDWordFinderEnumWordsStr(wf, str1, stelen(str1), 0, wp, d);

PDWordFinderEnumWordsStr(wf, str2, stelen(str2), stelen(str1), wp, d);

 
wordProc — 

A user-supplied callback to call once for each word found. Enumeration halts if wordProc returns false.

 
clientData — 

A pointer to user-supplied data to pass to wordProc each time it is called.

Returns

true if the enumeration was successfully completed, false if the enumeration was terminated because wordProc returned false.

See Also

Exceptions

genErrBadParm is raised if wordProc is NULL.
pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00060000

File: PDProcs.h
Line: 8561
PDWordFinderGetLatestAlgVersion() 
Product availability: All
Platform availability: All

Syntax

ASInt16 PDWordFinderGetLatestAlgVersion(PDWordFinder wObj)

Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.

Parameters

wObj — 

IN/OUT The word finder whose algorithm's version is obtained. Pass NULL to obtain the latest word finding algorithm version number.

Returns

The algorithm version associated with wObj, or the version of the latest word finder algorithm if wObj is NULL.

See Also

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h
Line: 4708
PDWordFinderGetNthWord() 
Product availability: All
Platform availability: All

Syntax

PDWord PDWordFinderGetNthWord(PDWordFinder wObj, ASInt32 nTh)

Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().

Parameters

wObj — 

IN/OUT The word finder whose nth word is obtained.

 
nTh — 

IN/OUT The index of the word to obtain. The first word on a page has an index of zero. Words are counted in PDF order. See the description of the wInfoP parameter in PDWordFinderAcquireWordList().

Returns

The nth word. It returns NULL when the end of the list is reached.

See Also

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h
Line: 2192
PDWordFinderReleaseWordList() 
Product availability: All
Platform availability: All

Syntax

void PDWordFinderReleaseWordList(PDWordFinder wObj, ASInt32 pgNum)

Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.

Parameters

wObj — 

A word finder object.

 
pgNum — 

The number of pages for which a word list is released.

See Also

Exceptions

genErrBadUnlock is raised if the list has already been released.

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h
Line: 4725