The samples and documentation provide sample “Hello World” code that speeds development with the Extract API. The code examples illustrate how to perform Extract PDF operations such as:
Extract as JSON the content & structure of text, table, and figure elements
Extract as JSON the content, structure & renditions of table and figure elements
Extract as JSON the content, structure & renditions of table and figure elements along with tables as CSVs
Extract as JSON the content, structure & renditions of table and figure elements along with Character Bounding Boxes
The output of an SDK extract operation is a zip package containing the following:
The structuredData.json file with the extracted content & PDF element structure. See the JSON schema.
A renditions folder(s) containing renditions for each element type selected as input. The folder name is either “tables” or “figures” depending on your specified element type. Each folder contains renditions with filenames that correspond to the element information in the JSON file.
The SDK supports providing authentication credentials at runtime. Doing so allows fetching the credentials from a secret server during runtime instead of storing them in a file. Please refer the following samples for details:
The APIs use inferred timeout properties and provide defaults. However, the SDK supports custom timeouts for the API calls. You can tailor the timeout settings for your environment and network speed. In addition to the details below, you can refer to working code samples:
Available properties:
connectTimeout: Default: 4000. The maximum allowed time in milliseconds for creating an initial HTTPS connection.
socketTimeout: Default: 10000. The maximum allowed time in milliseconds between two successive HTTP response packets.
Override the timeout properties via a custom ClientConfig
class:
ClientConfig clientConfig = ClientConfig.builder()
.withConnectTimeout(3000)
.withSocketTimeout(20000)
.build();
Available properties:
connectTimeout: Default: 10000. The maximum allowed time in milliseconds for creating an initial HTTPS connection.
readTimeout: Default: 20000. The maximum allowed time in milliseconds between two successive HTTP response packets.
Override the timeout properties via a custom ClientConfig
class:
const clientConfig = PDFToolsSdk.ClientConfig
.clientConfigBuilder()
.withConnectTimeout(15000)
.withReadTimeout(25000)
.build();
Available properties:
connectTimeout: Default: 4000. The number of milliseconds Requests will wait for the client to establish a connection to Server.
readTimeout: Default: 10000. The number of milliseconds the client will wait for the server to send a response.
Override the timeout properties via a custom ClientConfig
class:
client_config = ClientConfig.builder()
.with_connect_timeout(10000)
.with_read_timeout(40000)
.build()
Use the sample below to extract text element information from a PDF document.
The sample below extracts text, tables, and figures element information from a PDF document.
The sample below extracts text, table, and figure element information as well as table renditions from PDF Document. Note that the output is a zip containing the structured information along with renditions.
The sample below extracts table renditions and bounding boxes for characters present in text blocks(paragraphs, list, headings), in addition to text, table, and figure element information from PDF Document. Note that the output is a zip containing the structured information along with renditions.
The sample below adds option to get CSV output for tables in addition to extracting text, table, and figure element information as well as table renditions from PDF Document. Note that the output is a zip containing the structured information along with renditions.