Reading PDF Files Through the DOM Interface

Acrobat 6.0 and later defines a document object model (DOM) that provides more complete access to the document structure than the MSAA interface. The Accessibility plug-in defines and exports five COM interfaces in AcrobatAccess.lib that expose Acrobat’s document hierarchy:

  • IPDDomNode defines methods that apply to all elements of the document hierarchy.

  • IPDDomDocument interface is exported by the root object for the page or document.

  • IPDDomNodeExt interface is exported by every object that exports IPDDomNode.

  • IPDDomElement defines additional methods that apply only to structure elements.

  • IPDDomWord defines additional methods that apply only to individual words in the document.

  • IPDDomGroupInfo defines an additional method that applies to radio buttons, list boxes, and combo boxes.

Clients of these interfaces must include the files AcrobatAccess.h , AcrobatAccess_i.c and IPDDom.h.

IPDDomNode data types

This section describes the data types for the PDF DOM hierarchy.

CPDDomNodeType

Defines the type of a node in the PDF DOM hierarchy returned by GetType.

typedef enum {
   CPDDomNode_Document = 1,
   CPDDomNode_Page = 2,
   CPDDomNode_StructElement = 3,
   CPDDomNode_Text = 4,
   CPDDomNode_Word = 5,
   CPDDomNode_Char = 6,
   CPDDomNode_Graphic = 7,
   CPDDomNode_Link = 8,
   CPDDomNode_PushButtonField = 9,
   CPDDomNode_TextEditField =10,
   CPDDomNode_StaticTextField =11,
   CPDDomNode_ListboxField =12,
   CPDDomNode_ComboboxField =13,
   CPDDomNode_CheckboxField =14,
   CPDDomNode_RadioButtonField =15,
   PDDomNode_SignatureField =16,
   CPDDomNode_OtherField =17,
   CPDDomNode_Comment =18,
   CPDDomNode_TextComment =19,
   CPDDomNode_Other =20,
   CPDDomNode_LineSeg =21,
   CPDDomNode_WordSeg =22
} CPDDomNodeType;

PDDom_FontStyle

Constants for font styles returned by GetFontInfo.

typedef enum {
   PDDOM_FONTATTR_ITALIC = 0x1,
   PDDOM_FONTATTR_SMALLCAP = 0x2,
   PDDOM_FONTATTR_ALLCAP = 0x4,
   PDDOM_FONTATTR_SCRIPT = 0x8,
   PDDOM_FONTATTR_BOLD = 0x10,
   PDDOM_FONTATTR_LIGHT = 0x20
} PDDOM_FontStyle;

FontInfoState

Constants for font status returned by GetFontInfo.

typedef enum {
   FontInfo_Unchecked =1,
   FontInfo_NoInfo =2,
   FontInfo_MixedInfo =3,
   FontInfo_Valid =4
} FontInfoState;

DocState

Constants for document status returned by GetDocInfo in the IPDDomDocument interface.

enum DocState {
   DocState_OK =0,
   DocState_Protected =1,
   DocState_Empty =2,
   DocState_Unavailable =3
};

NodeRelationship

Constants returned by Relationship in the IPDDomNodeExt interface.

enum NodeRelationship {
   NodeRelationship_Descendant =0,
   NodeRelationship_Ancestor =1,
   NodeRelationship_Before =2,
   NodeRelationship_After =3
   NodeRelationship_Equal =4,
   NodeRelationship_None =5
};

IPDDomNode methods

IPDDomNode defines methods that apply to all elements of the document hierarchy.

Words and lines in text

An IPDDomNode that represents a text node has the role CPDDomNode_Text. By default, the children of text nodes are word nodes. To get the word children of a text node, call the IPDomNode method GetChild. An IPDDomNode that represents a word has the role CPDDomNode_Word.

Note

When a word is hyphenated and thus appears on two lines, each segment of the word is returned as a child that has the role CPDDom_WordSeg.

Text can also be thought of as having lines as children. To get the line children of a text node, call the IPDomNode method GetTextInLines. This method returns a new object for the text node. Subsequently, calling getChild on this object returns lines as children. An IPDDomNode that represents a line has the role CPDDomNode_LineSeg. The children of that line node will be the words in that line.

GetParent

ppDispParent returns the IPDDomNode for the parent of this element if there ís a parent element in the DOM hierarchy, or NULL if this element is the root element of the hierarchy.

LRESULT GetParent (IDispatch **ppDispParent)

GetType

nodeType returns the CPDDomNodeType of this element.

LRESULT GetType (long *nodeType)

GetChild

ppDispChild returns the IPDDomNode for the child of this element at position index , or NULL if there is no child at position index.

For a text node, this returns child words; see Words and lines in text.

LRESULT GetChild (ASInt32 index,  IDispatch **ppDispChild)

GetChildCount

pCountChildren returns the number of children of this element.

LRESULT GetChildCount (long *pCountChildren)

GetName

pszName returns the name of this element.

  • For individual words, this is NULL.

  • For form fields, it is the short description associated with the field.

  • For comments, it is a combination of the comment type and subject (if any).

LRESULT GetName (BSTR *pszName)

GetValue

pszValue returns the value of this element.

  • For individual words, this is the word itself.

  • For form fields, it is the current text content of the field.

  • For links, it is a description of the associated action.

  • For comments, it is the contents.

  • For a signature field, it is the name of the signer and the date signed.

LRESULT GetValue (BSTR *pszValue)

IsSame

If pNode refers to the same node as this element, isSame returns true.

LRESULT IsSame (IPDDomNode *pNode,  BOOL *isSame)

GetTextContent

pszText returns the value of all text in the document subtree rooted at this element. Alternate text, actual text, and expansion attributes are included and may override text within the document.

LRESULT GetTextContent (BSTR *pszText)

GetFontInfo

These values describe the font characteristics for the text content of this element.

  • fontStatus returns a value of type FontInfoState.

    • If value is FontInfo_NoInfo , the text is not rendered, so it has no font characteristics. For example, alternate text has no font characteristics.

    • If value is FontInfo_Valid , the rest of the values describe the font characteristics for all of the text in the subtree. That is, each word of the text either has these characteristics or has no font characteristics.

    • If value is FontInfo_MixedInfo , different words of the text have different font characteristics, and the document subtree must be examined more closely to determine which text has which font characteristics.

  • pszName returns the name of the font.

  • fontSize returns the point size.

  • fontAttr returns the set of PDDom_FontStyle values.

red, green, blue return the RGB components of the color of the text. Each component is a value between 0 and 1.

LRESULT GetFontInfo (long* fontStatus,  BSTR* pszName,  float* fontSize,  long* fontAttr, float* red,  float* green,  float* blue)

GetLocation

Returns the screen coordinates of the upper left corner, width, and height of the content of the element. Note that this is not exactly the same as the bounding box. If the element spans multiple pages, this method returns only the location on the first visible page. If none of the element’s contents are visible, this method returns an empty location.

LRESULT GetLocation (long *pxLeft, ong *pyTop,  long *pcxWidth,  long *pcyHeight)

GetFromID

ppDispNode returns the IPDDomNode for the element in the same document with the matching ID attribute, or NULL if there is no such element.

The id value is not the same as the UID returned by IAccID in the MSAA interface; it is an optional attribute of the PDF file itself, as returned by GetID in IPDDomElement.

LRESULT GetFromID (BSTR id,  IDispatch **ppDispNode)

GetIAccessible

Returns the MSAA IAccessible element corresponding to this element. (Acrobat exports an MSAA interface to the document, as well as a DOM interface.)

Not all DOM elements have corresponding MSAA elements, because the DOM tree breaks the content down into much smaller pieces. If ppIAccessible is NULL , search for an ancestor with a non-NULL value for GetIAccessible to find the corresponding MSAA interface.

Use the method get_PDDomNode to find the IPDDomNode corresponding to a PDF document IAccessible object.

LRESULT GetIAccessible (IDispatch **ppIAccessible)

ScrollTo

Makes the contents of the node visible. If the contents cover more than one page, only the contents on the first page are made visible. If the entire contents do not fit, the upper left portion is shown.

LRESULT ScrollTo()

GetTextInLines

ppDispTextLines returns an IPDDomNode whose children (obtained by calling GetChild ) have the role CPDDomNode_LineSeg ; see Words and lines in text.

visibleOnly controls whether the children include only lines that contain at least some visible text.

If the role the node is not CPDDomNode_Text , this method returns E_FAIL.

LRESULT GetTextInLines (BOOL visibleOnly,  IDispatch** ppDispTextLines)

IPDDomNodeExt methods

The IPDDomNodeExt interface is exported by every object that exports IPDDomNode. For Acrobat 7.0 and later, the following methods are available from all objects.

ScrollToEx

Determines where to scroll when the item is too large to fit in the window. If both parameters are true , this method is equivalent to ScrollTo. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT ScrollToEx(
BOOL favorLeft,
BOOL favorTop);

SetFocus

Sets the focus to this node, if it can take focus. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT SetFocus();

GetState

Returns a set of state flags identical to those returned by get_accState for the corresponding IAccessible object. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT GetState(
long* state);

GetIndex

Returns the child index of this node in its parent. The root node returns -1. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT GetIndex(
long* pIndex);

GetPageNum

Returns the first and last pages on which the node appears. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT GetPageNum(
long* firstPage,
long* lastPage);

DoDefaultAction

Executes the default action for a node. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT DoDefaultAction();

Relationship

Returns the relationship of the node parameter to this node. The value is of type NodeRelationship , defined in IPDDom.h. This method is defined in the IPDDomNodeExt interface on any node.

HRESULT Relationship(
PDDomNode* node,
long* pRel);

IPDDomDocument methods

The root object for the page or document exports the IPDDomDocument interface. For Acrobat 7.0 and later, the following methods are available from the root object.

SetCaret

Sets the caret to the specified index in the word. If the index is 0, it is placed at the beginning of the word.

HRESULT SetCaret(
IPDDomWord* node,
long index);

GetCaret

Returns the screen location of the caret, the node containing the caret, and the zero-based index of the caret within the node. The node may be a word node or a form field. If there is no active caret, the call returns S_FALSE.

HRESULT GetCaret(
long* pxLeft,
long* pyTop,
long* pcxWidth,
long* pcyHeight,
IPDDomNode** node,
long* index);

NextFocusNode

Gets the next or previous focusable IPDDomNode. If forward is true , it gets the next focusable node. Returns NULL if there is not another focusable node in the selected direction. Searches only the current DOM tree, which means that in page mode it will only return results within the page tree instead of the entire document.

HRESULT NextFocusNode(
BOOL forward,
IPDDomNode* node);

GetFocusNode

Returns the IPDDomNode with focus. The node is set to NULL if the focus is on the document (rather than an annotation) or if the focus is not within the document.

HRESULT GetFocusNode(
IPDDomNode* node);

SelectText

Sets the text selection by identifying the start and end locations of the selection.

HRESULT SelectText(
IPDDomWord* startNode,
long startIndex,
IPDDomWord* endNode,
long endIndex);

GetTextSelection

Retrieves the value of the selected text.

HRESULT GetTextSelection(
BSTR* selection);

GetSelectionBounds

Not implemented. This procedure always returns S_FALSE.

HRESULT GetSelectionBounds(
IPDDomWord** start,
long* startIndex,
IPDDomWord** stop,
long* stopIndex);

GetDocInfo

Returns the full pathname of the file, how many pages it contains, and the range of pages that are at least partially visible. The status indicates whether there are issues with this document or page, such as access controls prohibiting access or an apparently empty page or document. If lang is not NULL , it is the default language used in the document.

Note

The GetDocInfo and GoToPage methods use different numbering systems. The page numbers returned as firstVisiblePage and lastVisiblePage by GetDocInfo are based on page 1 as the first page of the document. However, the GoToPage method treats page 0 as the first page of the document. Therefore, you must adjust accordingly when passing the value of pageNum to GoToPage.

HRESULT GetDocInfo(
BSTR* fileName,
long* nPages,
long* firstVisiblePage,
long* lastVisiblePage,
long* status,
BSTR* lang);

GoToPage

Positions the document so that the requested page is visible.

Note

The GetDocInfo and GoToPage methods use different numbering systems. The page numbers returned as firstVisiblePage and lastVisiblePage by GetDocInfo are based on page 1 as the first page of the document. However, the GoToPage method treats page 0 as the first page of the document. Therefore, you must adjust accordingly when passing the value of pageNum to GoToPage.

HRESULT GoToPage(
long pageNum);

IPDDomElement Methods

IPDDomElement defines additional methods that apply only to structure elements.

GetTagName

pszTagName returns the structural element tag for this element.

LRESULT GetTagName (BSTR *pszTagName)

GetStdName

pszStdName returns the standard role for this element. The standard roles are:

Document, Part, Art, Sect, Div, BlockQuote, Caption, TOC, TOCI, Index, NonStruct, Private, Table, TR, TH, TD, L, LI, Lbl, LBody, P, H, H1, H2, H3, H4, H5, H6, Span, Quote, Note, Reference, BibEntry, Code, Link, Figure, Formula,Form

For details, see the PDF Reference.

LRESULT GetStdName (BSTR *pszStdName)

GetID

pszId returns the ID string associated with this element, if it has been defined.

The id value is not the same as the UID returned by IAccID in the MSAA interface; it is an optional attribute of the PDF file itself. For details, see the PDF Reference. .

LRESULT GetID (BSTR *pszId)

GetAttribute

pszAttrVal returns the value of the specified attribute for specified owner for this element. Owner can be NULL or an empty string.

If the element does not have the requested attribute, the method returns S_FALSE.

The set of owners and attributes is open-ended, but the standard structure attributes for Tagged PDF are defined in the PDF Reference. See the table below for accessibility attributes.

LRESULT GetAttribute (BSTR pszAttr, BSTR pszOwner,  BSTR *pszAttrVal)

Accessibility attributes

Some of the attributes that are useful for assistive technology are listed here. For a complete list, see the PDF Reference. .

Attribute

Owner

Value

Lang

ISO language code for text within this element.

Alt

Text containing an equivalent replacement for the content of this element.

Automatically incorporated into the value or text content of the element or any of its ancestor elements.

ActualText

Text which is an exact replacement for the content of this element, for example, the text of an illuminated character.

Automatically incorporated into the value or text content of the element or any of its ancestor elements.

E

The expanded form of the element’s content, when it is an abbreviation or acronym.

RowSpan

Table

Number of rows spanned by the table cell.

ColSpan

Table

Number of columns spanned by the table cell.

Headers

Table

Array of IDs of Table Header (TH) cells associated with this table cell (TD or TH).

Scope

Table

The scope of this table header cell: Row , Column , or Both.

Summary

Table

Text that describes the table’s purpose and structure, for use in non-visual rendering such as speech or Braille.

IPDDomWord methods

IPDDomWord defines additional methods that apply only to individual words in the document.

LastWordOfLine

If this is the last word in a line on the page, isLast returns true. Use this function to determine where the line breaks occur in text. Note that line breaks are inserted into the text content for elements.

LRESULT LastWordOfLine (BOOL *isLast)

IPDDomGroupInfo method

IPDDomGroupInfo defines an additional method that applies to radio buttons, list boxes, and combo boxes.

GetGroupPosition

groupSize returns the number of items in the radio button set, the list, or the combo box drop-down list. position returns the 1-based index of the node in that set of values. That is, a value of 1 for position indicates the first value in the set.

GetGroupPosition (long *groupSize, long *position)