PDWordFinder - Acrobat and PDF Library API Reference

	Define
	WF_LATEST_VERSION Used to obtain the latest available version.

	Typedef
	PDWordFinder Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.
	PDWordFinderConfig
	PDWordFinderConfigRec

	Structure
	_t_PDWordFinderConfig A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false.

	Callback
	PDWordFinderCtrlProc This is passed to PDWordFinderSetCtrlProc().

	Method
	void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord* wInfoP, PDWord xySortTable, PDWord rdOrderTable, ASInt32* numWords) Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
	void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord* wInfoP, PDWord xySortTable, PDWord rdOrderTable, ASInt32* numWords) Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
	void PDWordFinderDestroy(PDWordFinder wObj) Destroys a word finder. Use this when you are done extracting text in a file.
	ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void* clientData) Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
	ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void* clientData) Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
	ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val* ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void* clientData) Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.
	ASInt16 PDWordFinderGetLatestAlgVersion(PDWordFinder wObj) Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.
	PDWord PDWordFinderGetNthWord(PDWordFinder wObj, ASInt32 nTh) Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().
	void PDWordFinderReleaseWordList(PDWordFinder wObj, ASInt32 pgNum) Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.

Defines Detail

WF_LATEST_VERSION

Product availability: All

Platform availability: All

Syntax


#define WF_LATEST_VERSION 0

Description

Used to obtain the latest available version.

File: PDExpT.h

Line: 3562

Typedefs Detail

PDWordFinder

Product availability: All

Platform availability: All

Syntax

typedef struct _t_PDWordFinder* PDWordFinder;

Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.

File: PDExpT.h

Line: 3294

PDWordFinderConfig

Product availability: All

Platform availability: All

Syntax

typedef _t_PDWordFinderConfig PDWordFinderConfig;

File: PDExpT.h

Line: 3806

PDWordFinderConfigRec

Product availability: All

Platform availability: All

Syntax

typedef _t_PDWordFinderConfig PDWordFinderConfigRec;

File: PDExpT.h

Line: 3806

Structure Detail

_t_PDWordFinderConfig

Product availability: All

Platform availability: All

Syntax

struct _t_PDWordFinderConfig {

  ASSize_t recSize;  


 


  ASBool disableTaggedPDF;  


 


  ASBool noXYSort;  


 


  ASBool preserveSpaces;  


 


  ASBool noLigatureExp;  


 


  ASBool noEncodingGuess;  


 


  ASBool unknownToStdEnc;  


 


  ASBool ignoreCharGaps;  


 


  ASBool ignoreLineGaps;  


 


  ASBool noAnnots;  


 


  ASBool noHyphenDetection;  


 


  ASBool trustNBSpace;  


 


  ASBool noExtCharOffset;  


 


  ASBool noStyleInfo;  


 


  ASUns16 decomposeTbl;  


 


  ASSize_t decomposeTblSize;  


 


  ASUns16 charTypeTbl;  


 


  ASSize_t charTypeTblSize;  


 


  ASBool preserveRedundantChars;  


 


  ASBool disableCharReordering;  

}

A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false.

See Also

PDDocCreateWordFinderEx

File: PDExpT.h

Line: 3607

Elements

	recSize	This is always `sizeof(PDWordFinderConfigRec)`.

	disableTaggedPDF	When `true`, it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version (WF_LATEST_VERSION).

	noXYSort	When `true`, it disables generating an XY-ordered word list. This option replaces the sort order flags in the older version of the word finder creation command (PDDocCreateWordFinder()). Setting this option is equivalent to omitting the `WXE_XY_SORT` flag.

	preserveSpaces	When `true`, the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When `false` (the default), you can add spaces later by considering the word attribute flag `WXE_ADJACENT_TO_SPACE`, but there is no way to restore the exact number of consecutive space characters.

	noLigatureExp	When `true`, and the font has a ToUnicode table, it disables the expansion of ligatures using the default ligatures. The default ligatures are: fi ff fl ffi ffl st oe OE When `noLigatureExp` is `true` and the font does not have a ToUnicode table, the ligature is expanded based on whether there is a representation of the ligature in the defined `codePage`. If there is no representation, the ligature is expanded; otherwise, the ligature is not expanded.

	noEncodingGuess	When `true`, it disables guessing encoding of fonts that have unknown or custom encoding when there is no ToUnicode table. Inappropriate encoding conversions can cause the word finder to mistakenly recognize non-Roman single-byte fonts as Standard Roman encoding fonts and extract the text in an unusable format. When this option is selected, the word finder avoids such unreliable encoding conversions and tries to provide the original characters without any encoding conversion for a client with its own encoding handling. Use the PDWordGetCharEncFlags() method to detect such characters.

	unknownToStdEnc	When `true`, it assumes any font with unknown or custom encoding to be Standard Roman. This option overrides the `noEncodingGuess` option.

	ignoreCharGaps	When `true`, it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF.

	ignoreLineGaps	When `true`, it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF.

	noAnnots	When `true`, it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box.

	noHyphenDetection	When `true`, it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused.

	trustNBSpace	When `true`, it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused.

	noExtCharOffset	When `true`, it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list.

	noStyleInfo	When `true`, it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot use PDWordGetNthCharStyle() and PDWordGetStyleTransition() with the output of the word finder.

	decomposeTbl	A custom UTF-16 decomposition table. This table can be used to expand Unicode ligatures not included in the default ligature list. Each decomposition record contains a UTF-16 character code (either a 16-bit or 32-bit surrogate), a replacement UTF16 string, and the delimiter `0x0000`.

	decomposeTblSize	The size of the `decomposeTbl` in bytes.

	charTypeTbl	A custom character type table to enhance word breaking quality. Each character type record contains a region start value, a region end value, and a character type flag as defined in PDExpT.h. A character code is in UTF-16, and is either a 16-bit or a 32-bit surrogate.

	charTypeTblSize	The size of the `charTypeTbl` in bytes.

	preserveRedundantChars	When `true`, it disables detecting and removing redundant characters. Some PDF pages have the same text drawn multiple times on the same spot to get a special visual effect. Normally, those redundant characters are removed from the word finder output. Since this option may leave extra characters with overlapping bounding boxes, using it together with the `disableCharReordering` option is recommended for more consistent text extraction results.

	disableCharReordering	When `true`, it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering (`disableCharReordering = true`) may produce a more static result.

Callbacks Detail

PDWordFinderCtrlProc

Product availability: All

Platform availability: All

Syntax

ASBool (*PDWordFinderCtrlProc)(ASUns32 startTime, void *clientData)

This is passed to PDWordFinderSetCtrlProc().

This is the callback function called by Word Finder when its page enumeration process takes longer than the specified time (in seconds). Return true to continue the enumeration process, or false to stop. startTime is the value that was set by ASGetSecs() when the Word Finder started processing the current page.

File: PDExpT.h

Line: 3814

Method Detail

PDWordFinderAcquireVisibleWordList

()

Product availability: All

Platform availability: All

Syntax





void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)

Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.

The list contains only words that are visible in the given context. If the word states change in the given context, the word list will have to be released and re-acquired to reflect the changed set of visible words.

There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.

Use PDWordFinderEnumWords() instead of this method if you wish to find one word at a time instead of obtaining a table containing all visible words on a page.

This procedure is intended to replace the call to PDWordFinderAcquireWordList() in most cases where you want to work only with the content that is visible on screen (such as a text selection). Change this call to update an application to work with the Optional Content feature.

Parameters

	`wObj` —	The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.

	`pgNum` —	The page number for which words are found. First page is `0`, not `1` as designated in Acrobat.

	`ocContext` —	The context within which the words are in a visible state. `NULL` is equivalent to passing `PDDocGetOCContext(pdDoc)`.

	`wInfoP` —	(Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly. Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS().

	`xySortTable` —	(Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non-`NULL`, the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder().

	`rdOrderTable` —	Currently unused. Pass `NULL` for its value.

	`numWords` —	(Filled by the method) The number of visible words found on the page.

Exceptions

pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00060000

File: PDProcs.h

Line: 10506

PDWordFinderAcquireWordList

()

Product availability: All

Platform availability: All

Syntax





void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)

Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.

Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.

There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.

Use PDWordFinderEnumWords() instead of this method, if you wish to find one word at a time instead of obtaining a table containing all words on a page.

Parameters

	`wObj` —	The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.

	`pgNum` —	The page number for which words are found. The first page is `0`, not `1` as designated in Acrobat.

	`wInfoP` —	(Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly. Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord() to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS().

	`xySortTable` —	(Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, are followed by all words on the next line. This array is only filled if the `WXE_XY_SORT` flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non-`NULL`, the array is always filled regardless of the value of the `rdFlags` parameter in PDDocCreateWordFinder().

	`rdOrderTable` —	Currently unused. Pass `NULL` for this value.

	`numWords` —	(Filled by the method) The number of words found on the page.

Exceptions

pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h

Line: 4691

PDWordFinderDestroy

()

Product availability: All

Platform availability: All

Syntax





void PDWordFinderDestroy(PDWordFinder wObj)

Destroys a word finder. Use this when you are done extracting text in a file.

Parameters

wObj —

IN/OUT The word finder to destroy.

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h

Line: 4737

PDWordFinderEnumVisibleWords

()

Product availability: All

Platform availability: All

Syntax





ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void* clientData)

Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.

Only words that are visible in the given optional-content context are enumerated.

Parameters

	`wObj` —	A word finder object.

	`PageNum` —	The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.

	`ocContext` —	The context within which the words are in a visible state. `NULL` is equivalent to passing `PDDocGetOCContext(pdDoc)`.

	`wordProc` —	A user-supplied callback to call once for each word found. Enumeration halts if `wordProc` returns `false`.

	`clientData` —	A pointer to user-supplied data to pass to `wordProc` each time it is called.

Returns

true if enumeration was successfully completed, false if enumeration was terminated because wordProc returned false.

Exceptions

genErrBadParm is raised if wordProc is NULL, or pageNum is less than zero or greater than the total number of pages in the document.
pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00060000

File: PDProcs.h

Line: 10578

PDWordFinderEnumWords

()

Product availability: All

Platform availability: All

Syntax





ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void* clientData)

Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.

Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.

Parameters

	`wObj` —	A word finder object.

	`PageNum` —	The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.

	`wordProc` —	A user-supplied callback to call once for each word found. Enumeration halts if `wordProc` returns `false`.

	`clientData` —	A pointer to user-supplied data to pass to `wordProc` each time it is called.

Returns

true if enumeration was successfully completed, false if enumeration was terminated because wordProc returned false.

Exceptions

genErrBadParm is raised if wordProc is NULL, or pageNum is less than zero or greater than the total number of pages in the document.
pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h

Line: 4773

PDWordFinderEnumWordsStr

()

Product availability: All

Platform availability: All

Syntax





ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val* ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void* clientData)

Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.

The words extracted by this method do not have quads, text style, or text selection information. The character offset is calculated from the beginning of the input string, and is increased by 2 on every 16 bits of data (the character offset of a character in a PDWord is the byte offset of the character in the source Unicode string).

Parameters

	`wObj` —	A word finder object.

	`ucsStr` —	A pointer to the Unicode string.

	`strLen` —	The length of the string in bytes.

	`charOffsetAdj` —	The character offset value of the first character in the input Unicode string. This value is added to the word character offsets, and is used to maintain contiguous word character offsets when multiple strings (and multiple calls to this method) are combined into one word list. For example: `PDWordFinderEnumWordsStr(wf, str1, stelen(str1), 0, wp, d);` `PDWordFinderEnumWordsStr(wf, str2, stelen(str2), stelen(str1), wp, d);`

	`wordProc` —	A user-supplied callback to call once for each word found. Enumeration halts if `wordProc` returns `false`.

	`clientData` —	A pointer to user-supplied data to pass to `wordProc` each time it is called.

Returns

true if the enumeration was successfully completed, false if the enumeration was terminated because wordProc returned false.

Exceptions

genErrBadParm is raised if wordProc is NULL.
pdErrOpNotPermitted

Since

PI_PDMODEL_VERSION >= 0x00060000

File: PDProcs.h

Line: 8561

PDWordFinderGetLatestAlgVersion

()

Product availability: All

Platform availability: All

Syntax





ASInt16 PDWordFinderGetLatestAlgVersion(PDWordFinder wObj)

Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.

Parameters

wObj —

IN/OUT The word finder whose algorithm's version is obtained. Pass NULL to obtain the latest word finding algorithm version number.

Returns

The algorithm version associated with wObj, or the version of the latest word finder algorithm if wObj is NULL.

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h

Line: 4708

PDWordFinderGetNthWord

()

Product availability: All

Platform availability: All

Syntax





PDWord PDWordFinderGetNthWord(PDWordFinder wObj, ASInt32 nTh)

Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().

Parameters

	`wObj` —	IN/OUT The word finder whose nth word is obtained.

	`nTh` —	IN/OUT The index of the word to obtain. The first word on a page has an index of zero. Words are counted in PDF order. See the description of the `wInfoP` parameter in PDWordFinderAcquireWordList().

Returns

The nth word. It returns NULL when the end of the list is reached.

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h

Line: 2192

PDWordFinderReleaseWordList

()

Product availability: All

Platform availability: All

Syntax





void PDWordFinderReleaseWordList(PDWordFinder wObj, ASInt32 pgNum)

Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.

Parameters

	`wObj` —	A word finder object.

	`pgNum` —	The number of pages for which a word list is released.

Exceptions

genErrBadUnlock is raised if the list has already been released.

Since

PI_PDMODEL_VERSION >= 0x00020000

File: PDProcs.h

Line: 4725

Acrobat API Reference	All Layers \| All Objects \| Index \| Samples \| Frames No Frames
PDWordFinder	Defines \| Typedefs \| Structures \| Callbacks \| Methods

Layer	PD_Layer
Object	PDWordFinder

	ASSize_t recSize;

	ASBool disableTaggedPDF;

	ASBool noXYSort;

	ASBool preserveSpaces;

	ASBool noLigatureExp;

	ASBool noEncodingGuess;

	ASBool unknownToStdEnc;

	ASBool ignoreCharGaps;

	ASBool ignoreLineGaps;

	ASBool noAnnots;

	ASBool noHyphenDetection;

	ASBool trustNBSpace;

	ASBool noExtCharOffset;

	ASBool noStyleInfo;

	ASUns16 decomposeTbl;

	ASSize_t decomposeTblSize;

	ASUns16 charTypeTbl;

	ASSize_t charTypeTblSize;

	ASBool preserveRedundantChars;

	ASBool disableCharReordering;