Layer | PD_Layer |
Object | PDWordFinder |
A PDWordFinder extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document. The core API provides methods to extract words from a document, obtain information on the word finder, and to release a list of words after a plug-in is done using it.
To create a word finder, use PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDDocCreateWordFinderEx() is a version 6.0 replacement for PDDocCreateWordFinder() and PDDocCreateWordFinderUCS() that adds configurable word-breaking behavior.
There are two primary methods of using word finders:
Using PDWordFinderAcquireWordList(), which builds a word list for an entire page before it returns. This method can return the recognized words in two possible orders:
Define | ||
---|---|---|
WF_LATEST_VERSION
Used to obtain the latest available version.
|
Typedef | ||
---|---|---|
PDWordFinder
Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.
|
||
PDWordFinderConfig | ||
PDWordFinderConfigRec |
Structure | ||
---|---|---|
_t_PDWordFinderConfig
A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false.
|
Callback | ||
---|---|---|
PDWordFinderCtrlProc
This is passed to PDWordFinderSetCtrlProc().
|
Method | ||
---|---|---|
void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)
Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
|
||
void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)
Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
|
||
void PDWordFinderDestroy(PDWordFinder wObj)
Destroys a word finder. Use this when you are done extracting text in a file.
|
||
ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void* clientData)
Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
|
||
ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void* clientData)
Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
|
||
ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val* ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void* clientData)
Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.
|
||
ASInt16 PDWordFinderGetLatestAlgVersion(PDWordFinder wObj)
Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.
|
||
Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().
|
||
Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.
|
WF_LATEST_VERSION |
Product availability: All |
Platform availability: All |
#define WF_LATEST_VERSION 0
Used to obtain the latest available version.
File: PDExpT.h |
Line: 3562 |
PDWordFinder |
Product availability: All |
Platform availability: All |
typedef struct _t_PDWordFinder* PDWordFinder;
Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.
See Also
File: PDExpT.h |
Line: 3294 |
PDWordFinderConfig |
Product availability: All |
Platform availability: All |
typedef _t_PDWordFinderConfig PDWordFinderConfig;
File: PDExpT.h |
Line: 3806 |
PDWordFinderConfigRec |
Product availability: All |
Platform availability: All |
typedef _t_PDWordFinderConfig PDWordFinderConfigRec;
File: PDExpT.h |
Line: 3806 |
_t_PDWordFinderConfig |
Product availability: All |
Platform availability: All |
struct _t_PDWordFinderConfig {
A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false
.
See Also
File: PDExpT.h |
Line: 3607 |
recSize | This is always |
|
disableTaggedPDF | When |
|
noXYSort | When |
|
preserveSpaces | When |
|
noLigatureExp | When
When |
|
noEncodingGuess | When |
|
unknownToStdEnc | When |
|
ignoreCharGaps | When |
|
ignoreLineGaps | When |
|
noAnnots | When |
|
noHyphenDetection | When |
|
trustNBSpace | When |
|
noExtCharOffset | When |
|
noStyleInfo | When |
|
decomposeTbl | A custom UTF-16 decomposition table. This table can be used to expand Unicode ligatures not included in the default ligature list. Each decomposition record contains a UTF-16 character code (either a 16-bit or 32-bit surrogate), a replacement UTF16 string, and the delimiter |
|
decomposeTblSize | The size of the |
|
charTypeTbl | A custom character type table to enhance word breaking quality. Each character type record contains a region start value, a region end value, and a character type flag as defined in PDExpT.h. A character code is in UTF-16, and is either a 16-bit or a 32-bit surrogate. |
|
charTypeTblSize | The size of the |
|
preserveRedundantChars | When Since this option may leave extra characters with overlapping bounding boxes, using it together with the |
|
disableCharReordering | When |
PDWordFinderCtrlProc |
Product availability: All |
Platform availability: All |
This is passed to PDWordFinderSetCtrlProc().
This is the callback function called by Word Finder when its page enumeration process takes longer than the specified time (in seconds). Return true
to continue the enumeration process, or false
to stop. startTime
is the value that was set by ASGetSecs() when the Word Finder started processing the current page.
File: PDExpT.h |
Line: 3814 |
PDWordFinderAcquireVisibleWordList | () |
Product availability: All |
Platform availability: All |
void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)
Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
The list contains only words that are visible in the given context. If the word states change in the given context, the word list will have to be released and re-acquired to reflect the changed set of visible words.
There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.
Use PDWordFinderEnumWords() instead of this method if you wish to find one word at a time instead of obtaining a table containing all visible words on a page.
This procedure is intended to replace the call to PDWordFinderAcquireWordList() in most cases where you want to work only with the content that is visible on screen (such as a text selection). Change this call to update an application to work with the Optional Content feature.
Parameters
wObj — | The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list. |
|
pgNum — | The page number for which words are found. First page is |
|
ocContext — | The context within which the words are in a visible state. |
|
wInfoP — | (Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly. Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). |
|
xySortTable — | (Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non- |
|
rdOrderTable — | Currently unused. Pass |
|
numWords — | (Filled by the method) The number of visible words found on the page. |
See Also
Exceptions
Since
File: PDProcs.h |
Line: 10506 |
PDWordFinderAcquireWordList | () |
Product availability: All |
Platform availability: All |
void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord* wInfoP, PDWord** xySortTable, PDWord** rdOrderTable, ASInt32* numWords)
Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.
There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.
Use PDWordFinderEnumWords() instead of this method, if you wish to find one word at a time instead of obtaining a table containing all words on a page.
Parameters
wObj — | The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list. |
|
pgNum — | The page number for which words are found. The first page is |
|
wInfoP — | (Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly. Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord() to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). |
|
xySortTable — | (Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, are followed by all words on the next line. This array is only filled if the |
|
rdOrderTable — | Currently unused. Pass |
|
numWords — | (Filled by the method) The number of words found on the page. |
See Also
Exceptions
Since
File: PDProcs.h |
Line: 4691 |
PDWordFinderDestroy | () |
Product availability: All |
Platform availability: All |
void PDWordFinderDestroy(PDWordFinder wObj)
Destroys a word finder. Use this when you are done extracting text in a file.
Parameters
wObj — | IN/OUT The word finder to destroy. |
See Also
Since
File: PDProcs.h |
Line: 4737 |
PDWordFinderEnumVisibleWords | () |
Product availability: All |
Platform availability: All |
ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void* clientData)
Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
Only words that are visible in the given optional-content context are enumerated.
Parameters
wObj — | A word finder object. |
|
PageNum — | The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document. |
|
ocContext — | The context within which the words are in a visible state. |
|
wordProc — | A user-supplied callback to call once for each word found. Enumeration halts if |
|
clientData — | A pointer to user-supplied data to pass to |
|
See Also
Exceptions
wordProc
is NULL
, or pageNum
is less than zero or greater than the total number of pages in the document. Since
File: PDProcs.h |
Line: 10578 |
PDWordFinderEnumWords | () |
Product availability: All |
Platform availability: All |
ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void* clientData)
Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.
Parameters
wObj — | A word finder object. |
|
PageNum — | The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document. |
|
wordProc — | A user-supplied callback to call once for each word found. Enumeration halts if |
|
clientData — | A pointer to user-supplied data to pass to |
|
See Also
Exceptions
wordProc
is NULL
, or pageNum
is less than zero or greater than the total number of pages in the document. Since
File: PDProcs.h |
Line: 4773 |
PDWordFinderEnumWordsStr | () |
Product availability: All |
Platform availability: All |
ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val* ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void* clientData)
Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.
The words extracted by this method do not have quads, text style, or text selection information. The character offset is calculated from the beginning of the input string, and is increased by 2
on every 16 bits of data (the character offset of a character in a PDWord is the byte offset of the character in the source Unicode string).
Parameters
wObj — | A word finder object. |
|
ucsStr — | A pointer to the Unicode string. |
|
strLen — | The length of the string in bytes. |
|
charOffsetAdj — | The character offset value of the first character in the input Unicode string. This value is added to the word character offsets, and is used to maintain contiguous word character offsets when multiple strings (and multiple calls to this method) are combined into one word list. For example:
|
|
wordProc — | A user-supplied callback to call once for each word found. Enumeration halts if |
|
clientData — | A pointer to user-supplied data to pass to |
|
See Also
Exceptions
Since
File: PDProcs.h |
Line: 8561 |
PDWordFinderGetLatestAlgVersion | () |
Product availability: All |
Platform availability: All |
ASInt16 PDWordFinderGetLatestAlgVersion(PDWordFinder wObj)
Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.
Parameters
wObj — | IN/OUT The word finder whose algorithm's version is obtained. Pass |
The algorithm version associated with |
See Also
Since
File: PDProcs.h |
Line: 4708 |
PDWordFinderGetNthWord | () |
Product availability: All |
Platform availability: All |
PDWord PDWordFinderGetNthWord(PDWordFinder wObj, ASInt32 nTh)
Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().
Parameters
wObj — | IN/OUT The word finder whose nth word is obtained. |
|
nTh — | IN/OUT The index of the word to obtain. The first word on a page has an index of zero. Words are counted in PDF order. See the description of the |
The nth word. It returns |
See Also
Since
File: PDProcs.h |
Line: 2192 |
PDWordFinderReleaseWordList | () |
Product availability: All |
Platform availability: All |
void PDWordFinderReleaseWordList(PDWordFinder wObj, ASInt32 pgNum)
Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.
Parameters
wObj — | A word finder object. |
|
pgNum — | The number of pages for which a word list is released. |
See Also
Exceptions
Since
File: PDProcs.h |
Line: 4725 |