Layer | PD_Layer |
Object | PDWord |
A PDWord object represents a word in a PDF file. Each word contains a sequence of characters in one or more styles (see PDStyle).
All characters in a word are not necessarily physically adjacent. For example, words can be hyphenated across line breaks on a page.
Each character in a word has a character type. Character types include: control code, lowercase letter, uppercase letter, digit, punctuation mark, hyphen, soft hyphen, ligature, white space, comma, period, unmapped glyph, end-of-phrase glyph, wildcard, word break, and glyphs that cannot be represented in the destination font encoding.
The PDWordGetCharacterTypes() method can get the character type for each character in a word. The PDWordGetAttr() method
returns a mask containing information on the types of characters in a word. The mask is the logical OR
of several flags, including the following:
fi
ligature used to replace the two-character sequence, f
followed by i
). Ligatures are used to improve the appearance of a word.
A word's location is specified by the offset of its first character from the beginning of the page (known as the character offset). The characters are enumerated in the order in which they appear in page's content stream in the PDF file (which is not necessarily the order in which the characters are read when displayed or printed).
A word also has a character delta, which is the difference between the number of characters representing the word in the PDF file and the number of characters in the word. The character delta is non-zero, for example, when a word contains a ligature.
Typedef | ||
---|---|---|
PDWord
A word in a PDF file. Each word contains a sequence of characters in one or more styles (see PDStyle).
|
Callback | ||
---|---|---|
PDWordProc
A callback for PDWordFinderEnumWords. It is called once for each word.
|
Method | ||
---|---|---|
Creates a text selection object for a given page that includes all words in a word list, as returned from a PDWordFinder method. The text selection can then be set as the current selection using AVDocSetSelection().
|
||
ASBool PDWordFilterString(ASUns16* infoArray, char* cNewWord, char* cOldWord)
Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters (
'*'
and
'?'
) or any punctuation surrounded by alphanumeric characters within the word.
|
||
Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters ('*' and '?') or any punctuation surrounded by alphanumeric characters within the word. It also converts ligatures to their constituent characters. The determination of which characters to remove is made by examining the flags in the outEncInfo array passed to PDDocCreateWordFinder(). As a result, this method is most useful after you have been called with words obtained by calling PDWordFinderGetNthWord(), in the callback for PDWordFinderEnumWords(), and words in the pXYSortTable returned by PDWordFinderAcquireWordList(). See the description of PDWordFilterString() for further information, and for a description of how the two methods differ.
|
||
Copies the text from a word into an ASText object. It automatically performs the necessary encoding conversions from the specified word (either in Unicode or Host Encoding) to the ASText object.
|
||
ASUns16 PDWordGetAttr(PDWord word)
Gets a bit field containing information on the types of characters in a word. Use PDWordGetCharacterTypes() if you wish to check each character's type individually.
|
||
This is a version 6.0 extension of PDWordGetAttr() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher. It can get an additional 16-bit flag group defined in Acrobat 6.
|
||
Returns the byte offset within the specified word of the highlightable character at the specified character offset. The first character of a word is at byte offset 0. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
|
||
Gets the character type for each character in a word.
|
||
ASInt8 PDWordGetCharDelta(PDWord word)
Gets the difference between the word length (the number of printed characters in the word) and the PDF word length (the number of character codes in the word). For instance, if the PDF word is fi (ligature) sh the mapped word will be "fish". The ligature occupies only one character code, so in this case the character delta will be 3-4 = -1.
|
||
Gets the WordFinder Character Encoding Flags for each character in a word, which specify how reliably the word finder identified the character encoding.
|
||
ASUns16 PDWordGetCharOffset(PDWord word)
Returns a word's character offset from the beginning of its page. This information, together with the character delta obtained from PDWordGetCharDelta(), can be used to highlight a range of words on a page, using PDTextSelectCreatePageHilite().
|
||
ASUns32 PDWordGetCharOffsetEx(PDWord word, ASUns32 byteIdx, ASUns32* bytesConsumed, ASUns32* offsetLen)
This is a version 6.0 extension of PDWordGetCharOffset() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
|
||
Gets the quadrilateral bounding of the character at a given index position in the word. If the specified character is constructed with multiple bytes, only the first byte returns a valid quad. Otherwise, this method returns false.
|
||
ASUns8 PDWordGetLength(PDWord word)
Gets the number of bytes in a word. This method also works on non-Roman systems.
|
||
Returns a PDStyle object for the nth style in a word.
|
||
Gets the specified word's nth quad, specified in user space coordinates. See PDWordGetNumQuads() for a description of a quad.
|
||
ASUns32 PDWordGetNumHiliteChar(PDWord word)
Gets the number of highlightable characters in a word. A highlightable character is the minimum text unit that Acrobat can select and highlight. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
|
||
ASInt16 PDWordGetNumQuads(PDWord word)
Gets the number of quads in a word. A quad is a quadrilateral bounding a contiguous piece of a word. Every word has at least one quad. A word has more than one quad, for example, if it is hyphenated and split across multiple lines or if the word is set on a curve rather than on a straight line.
|
||
This method gets a word's text. The string to return includes any word break characters (such as space characters) that follow the word, but not any that precede the word. The characters that are treated as word breaks are defined in the outEncInfo parameter of PDDocCreateWordFinder() method. Use PDWordFilterString() to subsequently remove the word break characters.
|
||
Gets the locations of style transitions in a word. Every word has at least one style transition, at character position zero in the word.
|
||
Tests whether a word is visible in a given optional-content context on a given page.
|
||
ASBool PDWordIsRotated(PDWord word)
Tests whether a word is rotated.
|
||
Makes a word visible in a given optional-content context on a given page.
|
||
Splits the specified string into words by substituting spaces for word separator characters. The list of characters considered to be word separators can be specified, or a default list can be used.
|
PDWord |
Product availability: All |
Platform availability: All |
typedef struct _t_PDWord* PDWord;
A word in a PDF file. Each word contains a sequence of characters in one or more styles (see PDStyle).
See Also
File: PDExpT.h |
Line: 3303 |
PDWordProc |
Product availability: All |
Platform availability: All |
ASBool (*PDWordProc)(PDWordFinder wObj, PDWord wInfo, ASInt32 pgNum, void *clientData)
A callback for PDWordFinderEnumWords. It is called once for each word.
See Also
File: PDExpT.h |
Line: 3325 |
PDWordCreateTextSelect | () |
Product availability: All |
Platform availability: All |
PDTextSelect PDWordCreateTextSelect(PDPage page, PDWord* wList, ASUns32 wListLen)
Creates a text selection object for a given page that includes all words in a word list, as returned from a PDWordFinder
method. The text selection can then be set as the current selection using AVDocSetSelection().
Parameters
page — | The page on which to select the words. |
|
wList — | The word list to be selected. |
|
wListLen — | The number of words in the word list. |
The newly created text selection. |
See Also
Since
File: PDProcs.h |
Line: 8514 |
PDWordFilterString | () |
Product availability: All |
Platform availability: All |
Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters (
and '*'
) or any punctuation surrounded by alphanumeric characters within the word.'?'
The determination of which characters are alphanumeric, wildcard, punctuation, and so forth, is made by the values in infoArray
.
Although this method seems very similar to PDWordFilterWord(), the two methods treat letters and digits slightly differently. PDWordFilterWord() uses the encoding info array but also does a straight character code test for any characters that have not been mapped to anything. It does this to catch letters and digits from non-standard character sets, and is necessary to avoid removing words with non-standard character sets.
PDWordFilterString(), on the other hand, was designed for known character sets such as WinAnsi and Mac Roman.
For non-Roman character set viewers, this method currently supports only SHIFT-JIS encoding on a Japanese system.
Parameters
infoArray — | An array specifying the type of each character in the font. Each entry in this table must be one of the Character Type Codes. If |
|
cNewWord — | (Filled by the method) The filtered word. |
|
cOldWord — | The unfiltered word. This value must be passed to the method. |
|
See Also
Since
File: PDProcs.h |
Line: 5045 |
PDWordFilterWord | () |
Product availability: All |
Platform availability: All |
Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters ('*'
and '?'
) or any punctuation surrounded by alphanumeric characters within the word. It also converts ligatures to their constituent characters. The determination of which characters to remove is made by examining the flags in the outEncInfo
array passed to PDDocCreateWordFinder(). As a result, this method is most useful after you have been called with words obtained by calling PDWordFinderGetNthWord(), in the callback for PDWordFinderEnumWords(), and words in the pXYSortTable returned by PDWordFinderAcquireWordList(). See the description of PDWordFilterString() for further information, and for a description of how the two methods differ.
The Acrobat Catalog program uses this method to filter words before indexing them.
This method works with non-Roman systems.
Parameters
word — | ||
buffer — | (Filled by the method) The filtered string. |
|
bufferLen — | The maximum number of characters that |
|
newLen — | (Filled by the method) The number of characters actually written into |
|
See Also
Since
File: PDProcs.h |
Line: 5081 |
PDWordGetASText | () |
Product availability: All |
Platform availability: All |
Copies the text from a word into an ASText object. It automatically performs the necessary encoding conversions from the specified word (either in Unicode or Host Encoding) to the ASText object.
Parameters
word — | ||
filter — | Character types to be dropped from the output string. For example, the following returns text without soft hyphens and accent marks:
|
|
str — | An existing ASText object whose content will be replaced by the new text. |
Since
File: PDProcs.h |
Line: 8431 |
PDWordGetAttr | () |
Product availability: All |
Platform availability: All |
Gets a bit field containing information on the types of characters in a word. Use PDWordGetCharacterTypes() if you wish to check each character's type individually.
Parameters
word — | IN/OUT The word whose character types are obtained. |
A bit field containing information on the types of characters in word. The value is a logical |
See Also
Since
File: PDProcs.h |
Line: 4837 |
PDWordGetAttrEx | () |
Product availability: All |
Platform availability: All |
This is a version 6.0 extension of PDWordGetAttr() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher. It can get an additional 16-bit flag group defined in Acrobat 6.
It gets a bit field containing information on the types of characters in a word. Use PDWordGetCharacterTypes() if you wish to check each character's type individually.
Parameters
word — | The word whose character types are obtained. |
|
groupID — | The group number of the Word Attributes flags:
|
A bit field containing information on the types of characters in |
See Also
Since
File: PDProcs.h |
Line: 8489 |
PDWordGetByteIdxFromHiliteChar | () |
Product availability: All |
Platform availability: All |
Returns the byte offset within the specified word of the highlightable character at the specified character offset. The first character of a word is at byte offset 0
. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
The returned byte offset can be passed to PDWordGetCharOffsetEx() and PDWordGetCharQuad() to get additional information. Use PDWordGetNumHiliteChar() to get the number of highlightable characters in a word.
Parameters
word — | The word containing the character. |
|
charIdx — | The character index within the word. |
The byte offset of the specified character within the word, or |
See Also
Since
File: PDProcs.h |
Line: 8414 |
PDWordGetCharacterTypes | () |
Product availability: All |
Platform availability: All |
Gets the character type for each character in a word.
Parameters
word — | The word whose character types are obtained. |
|
cArr — | (Filled by the method) An array of character types. This array contains one element for each character in the word. Use PDWordGetLength() to determine the number of elements that must be in the array. Each element is the logical |
|
size — | The number of elements in |
See Also
Since
File: PDProcs.h |
Line: 4859 |
PDWordGetCharDelta | () |
Product availability: All |
Platform availability: All |
Gets the difference between the word length (the number of printed characters in the word) and the PDF word length (the number of character codes in the word). For instance, if the PDF word is fi (ligature) sh
the mapped word will be "fish"
. The ligature occupies only one character code, so in this case the character delta will be 3-4 = -1
.
Parameters
word — | IN/OUT The word whose character delta is obtained. |
The character delta for word. Cast the return value to an ASInt8 before using. If the PDWord's character set has no ligatures, such as on a non-Roman viewer supporting Japanese, returns |
See Also
Since
File: PDProcs.h |
Line: 4898 |
PDWordGetCharEncFlags | () |
Product availability: All |
Platform availability: All |
Gets the WordFinder Character Encoding Flags for each character in a word, which specify how reliably the word finder identified the character encoding.
This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
Parameters
word — | ||
fList — | (Filled by the method) An array of character encoding flags types. This array contains one element for each byte of text in the word. The byte length of the text can be determined with PDWordGetLength(). Each element is the logical |
|
size — | The maximum number of elements in the array |
See Also
Since
File: PDProcs.h |
Line: 8456 |
PDWordGetCharOffset | () |
Product availability: All |
Platform availability: All |
Returns a word's character offset from the beginning of its page. This information, together with the character delta obtained from PDWordGetCharDelta(), can be used to highlight a range of words on a page, using PDTextSelectCreatePageHilite().
Parameters
word — | IN/OUT The word whose character offset is obtained. |
The word's character offset. On multi-byte systems, it points to the first byte. |
See Also
Since
File: PDProcs.h |
Line: 4877 |
PDWordGetCharOffsetEx | () |
Product availability: All |
Platform availability: All |
ASUns32 PDWordGetCharOffsetEx(PDWord word, ASUns32 byteIdx, ASUns32* bytesConsumed, ASUns32* offsetLen)
This is a version 6.0 extension of PDWordGetCharOffset() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
It returns the character offset for a character identified by its index number, and the number of bytes (length) used for that character. The length is usually 1
for single-byte characters and 2
for double-byte characters. If multiple bytes are used to construct one character, only the first byte has valid character offset information and the other bytes have zero offset length with the same character offset of the first byte. If the returned offset length is zero, it means the specified byte in the word is a part (other than the first byte) of a multi-byte character.
The character offset is the character position calculated in bytes from the beginning of a page. Because of the encoding conversions and character replacements applied by the word finder, some characters may have different byte lengths from the original PDF content. The character offset itself can locate a character in the PDF content. However, without the offset length (that is the number of bytes in the PDF content), clients cannot tell whether two characters are next to each other in the PDF content. For example, suppose you want to create a Text Select object of two characters at character offset 1
and 3
. You can create an object with two disconnected ranges of [Offset 1, The length 1]
and [Offset 3, The length 1]
. However, if you know that the offset length of both characters is 2
, you can create a simpler object with a single range of [Offset 1, The length 4]
.
Parameters
word — | The word whose character offset is obtained. |
|
byteIdx — | The byte index within the word of the character whose offset is obtained. Valid values are |
|
bytesConsumed — | (Filled by method) Returns the number of bytes in the word that are occupied by the specified character. It can be |
|
offsetLen — | (Filled by the method) Returns the number of bytes occupied by the specified character in the original PDF content. This is |
The word's character offset and the number of bytes occupied by the character. |
See Also
Since
File: PDProcs.h |
Line: 8342 |
PDWordGetCharQuad | () |
Product availability: All |
Platform availability: All |
ASBool PDWordGetCharQuad(PDWord word, ASUns32 byteIdx, ASFixedQuad* quad)
Gets the quadrilateral bounding of the character at a given index position in the word. If the specified character is constructed with multiple bytes, only the first byte returns a valid quad. Otherwise, this method returns false
.
This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
Parameters
word — | The word whose character offset is obtained. |
|
byteIdx — | The byte index within the word of the character whose quad is obtained. Valid values are |
|
quad — | (Filled by method) A pointer to an existing quad structure in which to return the character's quad specified in user-space coordinates. |
|
See Also
Since
File: PDProcs.h |
Line: 8368 |
PDWordGetLength | () |
Product availability: All |
Platform availability: All |
Gets the number of bytes in a word. This method also works on non-Roman systems.
Parameters
word — | IN/OUT The word object whose character count is obtained. |
The number of characters in word. |
See Also
Since
File: PDProcs.h |
Line: 4787 |
PDWordGetNthCharStyle | () |
Product availability: All |
Platform availability: All |
PDStyle PDWordGetNthCharStyle(PDWordFinder wObj, PDWord word, ASInt32 dex)
Returns a PDStyle object for the nth style in a word.
Parameters
wObj — | IN/OUT A word finder object. |
|
word — | IN/OUT The word whose nth style is obtained. |
|
dex — | IN/OUT The index of the style to obtain. The first style in a word has an index of zero. |
The nth style in the word. It returns |
See Also
Exceptions
dex < 0
. Since
File: PDProcs.h |
Line: 4934 |
PDWordGetNthQuad | () |
Product availability: All |
Platform availability: All |
ASBool PDWordGetNthQuad(PDWord word, ASInt16 nTh, ASFixedQuad* quad)
Gets the specified word's nth quad, specified in user space coordinates. See PDWordGetNumQuads() for a description of a quad.
The quad's height is the height of the font's bounding box, not the height of the tallest character used in the word. The font's bounding box is determined by the glyphs in the font that extend farthest above and below the baseline; it often extends somewhat above the top of 'A'
and below the bottom of 'y'
.
The quad's width is determined from the characters actually present in the word.
For example, the quads for the words "AWAY"
and "away"
have the same height, but generally do not have the same width unless the font is a mono-spaced font (a font in which all characters have the same width).
Despite the names of the fields in an ASFixedQuad (tl
for top left, bl
for bottom left, and so forth) the corners of quad
do not necessarily have these positions.
Parameters
word — | The word whose nth quad is obtained. |
|
nTh — | The quad to obtain. A word's first quad has an index of zero. |
|
quad — | (Filled by the method) A pointer to the word's nth quad, specified in user-space coordinates. |
|
See Also
Since
File: PDProcs.h |
Line: 4984 |
PDWordGetNumHiliteChar | () |
Product availability: All |
Platform availability: All |
Gets the number of highlightable characters in a word. A highlightable character is the minimum text unit that Acrobat can select and highlight. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
Because of the encoding conversion, the characters in a word finder word list do not have a 1-to-1 correspondence to the characters displayed by Acrobat. For example, if the word is "fish"
and the text operation in PDF content is "fi"
(ligature) + 's' + 'h'
, this method returns the number of highlightable characters as 3
, counting "fi"
as one character. For the same word, the PDWordGetLength() method returns the byte-length as 4
.
Parameters
word — | The word whose highlightable character count is obtained. |
The number of highlightable characters in |
See Also
Since
File: PDProcs.h |
Line: 8391 |
PDWordGetNumQuads | () |
Product availability: All |
Platform availability: All |
Gets the number of quads in a word. A quad is a quadrilateral bounding a contiguous piece of a word. Every word has at least one quad. A word has more than one quad, for example, if it is hyphenated and split across multiple lines or if the word is set on a curve rather than on a straight line.
Parameters
word — | IN/OUT The word whose quad count is obtained. |
The number of quads in word. |
See Also
Since
File: PDProcs.h |
Line: 4949 |
PDWordGetString | () |
Product availability: All |
Platform availability: All |
This method gets a word's text. The string to return includes any word break characters (such as space characters) that follow the word, but not any that precede the word. The characters that are treated as word breaks are defined in the outEncInfo
parameter of PDDocCreateWordFinder() method. Use PDWordFilterString() to subsequently remove the word break characters.
This method produces a string in whatever encoding the PDWord uses, for both Roman and non-Roman systems.
Parameters
word — | The word whose string is obtained. |
|
str — | (Filled by the method) The string. The encoding of the string is the encoding used by the |
|
len — | The length of |
See Also
Exceptions
Since
File: PDProcs.h |
Line: 4816 |
PDWordGetStyleTransition | () |
Product availability: All |
Platform availability: All |
Gets the locations of style transitions in a word. Every word has at least one style transition, at character position zero in the word.
Parameters
word — | ||
transTbl — | IN/OUT (Filled by the method) An array of style transitions. Each element is the character offset in word where the style changes. The offset specifies the first character in the word that has the new style. The first character in a word has an offset of zero. |
|
size — | IN/OUT The number of entries that |
The number of style transition offsets copied to |
See Also
Since
File: PDProcs.h |
Line: 4919 |
PDWordIsCurrentlyVisible | () |
Product availability: All |
Platform availability: All |
ASBool PDWordIsCurrentlyVisible(PDWord word, ASInt32 pageNum, PDOCContext ctx)
Tests whether a word is visible in a given optional-content context on a given page.
Parameters
word — | The word to test. |
|
pageNum — | The page number for which the word is tested. |
|
ctx — | The context in which the word is tested, as returned by |
|
See Also
Since
File: PDProcs.h |
Line: 10523 |
PDWordIsRotated | () |
Product availability: All |
Platform availability: All |
Tests whether a word is rotated.
Parameters
word — | The word to test. |
|
See Also
Since
File: PDProcs.h |
Line: 4993 |
PDWordMakeVisible | () |
Product availability: All |
Platform availability: All |
ASBool PDWordMakeVisible(PDWord word, ASInt32 pageNum, PDOCContext ctx)
Makes a word visible in a given optional-content context on a given page.
Parameters
word — | The word to test. |
|
pageNum — | ||
ctx — | The context in which the word is to be made visible, as returned by |
|
See Also
Since
File: PDProcs.h |
Line: 10540 |
PDWordSplitString | () |
Product availability: All |
Platform availability: All |
Splits the specified string into words by substituting spaces for word separator characters. The list of characters considered to be word separators can be specified, or a default list can be used.
The characters ','
and '.'
are context-sensitive word separators. If surrounded by digits (for example, 654,096.345
), they are not considered word separators.
For non-Roman character set viewers, this method currently supports only SHIFT-JIS encoding on a Japanese system.
Parameters
infoArray — | A character information table. It specifies each character's type; word separator characters must be marked as |
|
cNewWord — | (Filled by the method) The word that has been split. Word separator characters have been replaced with spaces. |
|
cOldWord — | The word to split. |
|
nMaxLen — | The number of characters that |
The number of splits that occurred. |
See Also
Exceptions
Since
File: PDProcs.h |
Line: 2227 |