Overview
Extract PDF
ExtractPDFOperation
- class adobe.pdfservices.operation.pdfops.extract_pdf_operation.ExtractPDFOperation(create_key)
Bases:
adobe.pdfservices.operation.operation.Operation
An Operation that extracts pdf elements such as text and tables in a structured format from a PDF, along with renditions for tables and figures.
Sample usage.
try: base_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) credentials = Credentials.service_account_credentials_builder() \ .from_file(base_path + "/pdfservices-api-credentials.json") \ .build() execution_context = ExecutionContext.create(credentials) extract_pdf_operation = ExtractPDFOperation.create_new() source = FileRef.create_from_local_file(base_path + "/resources/extractPdfInput.pdf") extract_pdf_operation.set_input(source) extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \ .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \ .with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES, ExtractRenditionsElementType.FIGURES]) \ .with_get_char_info(True) \ .with_include_styling_info(True) \ .build() extract_pdf_operation.set_options(extract_pdf_options) result: FileRef = extract_pdf_operation.execute(execution_context) result.save_as(base_path + "/output/ExtractTextTableWithFigureTableRendition.zip") except (ServiceApiException, ServiceUsageException, SdkException): logging.exception("Exception encountered while executing operation")
- SUPPORTED_SOURCE_MEDIA_TYPES = {adobe.pdfservices.operation.internal.extension_media_type_mapping.ExtensionMediaTypeMapping.PDF.mime_type}
Supported source file formats for
ExtractPdfOperation
is .pdf.
- classmethod create_new()
creates a new instance of ExtractPDFOperation.
- Returns
A new instance of ExtractPDFOperation
- Return type
- execute(execution_context: adobe.pdfservices.operation.execution_context.ExecutionContext)
Executes this operation synchronously using the supplied context and returns a new FileRef instance for the resulting Zip file. The resulting file may be stored in the system temporary directory. See
adobe.pdfservices.operation.io.file_ref.FileRef
for how temporary resources are cleaned up.- Parameters
execution_context (ExecutionContext) – The context in which the operation will be executed.
- Returns
The FileRef to the result.
- Return type
- Raises
ServiceApiException – if an API call results in an error response.
- get_options()
gets the ExtractPDFOptions.
- Returns
The options parameter of the operation
- Return type
- set_input(source_file_ref: adobe.pdfservices.operation.io.file_ref.FileRef)
Sets an input file.
- Parameters
source_file_ref (FileRef) – An input file.
- Returns
This instance to add any additional parameters.
- Return type
- set_options(extract_pdf_options: adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options.ExtractPDFOptions)
sets the ExtractPDFOptions.
- Parameters
extract_pdf_options (ExtractPDFOptions) – ExtractPDFOptions to set.
- Returns
This instance to add any additional parameters.
- Return type
ExtractPDFOptions
- class adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options.ExtractPDFOptions(elements_to_extract, elements_to_extract_renditions, get_char_info, table_output_format, include_styling_info=None)
Bases:
object
An Options Class that defines the options for ExtractPDFOperation.
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \ .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \ .with_get_char_info(True) \ .with_table_structure_format(TableStructureType.CSV) \ .with_elements_to_extract_renditions([ExtractRenditionsElementType.FIGURES, ExtractRenditionsElementType.TABLES]) \ .with_include_styling_info(True) \ .build()
- class Builder
Bases:
object
The builder for
ExtractPDFOptions
.- build()
- with_element_to_extract(element_to_extract: adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type.ExtractElementType)
adds a pdf element type for extracting structured information.
- Parameters
element_to_extract (ExtractElementType) – ExtractElementType to be extracted
- Returns
This Builder instance to add any additional parameters.
- Return type
- Raises
ValueError – if element_to_extract is None.
- with_element_to_extract_renditions(element_to_extract_renditions: adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type.ExtractRenditionsElementType)
adds a pdf element type for extracting rendition.
- Parameters
element_to_extract_renditions (ExtractRenditionsElementType) – ExtractRenditionsElementType whose renditions have to be extracted
- Returns
This Builder instance to add any additional parameters.
- Return type
- Raises
ValueError – if element_to_extract_renditions is None.
- with_elements_to_extract(elements_to_extract: List[adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type.ExtractElementType])
adds a list of pdf element types for extracting structured information.
- Parameters
elements_to_extract (List[ExtractElementType]) – List of ExtractElementType to be extracted
- Returns
This Builder instance to add any additional parameters.
- Return type
- Raises
ValueError – if elements_to_extract is None or empty list.
- with_elements_to_extract_renditions(elements_to_extract_renditions: List[adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type.ExtractRenditionsElementType])
adds a list of pdf element types for extracting rendition.
- Parameters
elements_to_extract_renditions (List[ExtractRenditionsElementType]) – List of ExtractRenditionsElementType whose renditions have to be extracted
- Returns
This Builder instance to add any additional parameters.
- Return type
- Raises
ValueError – if elements_to_extract is None or empty list.
- with_get_char_info(get_char_info: bool)
sets the Boolean specifying whether to add character level bounding boxes to output json
- Parameters
get_char_info (bool) – Set True to extract character level bounding boxes information
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_include_styling_info(include_styling_info: bool)
sets the Boolean specifying whether to add PDF Elements Styling Info to output json
- Parameters
include_styling_info (bool) – Set True to extract PDF Elements Styling Info
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_table_structure_format(table_structure: adobe.pdfservices.operation.pdfops.options.extractpdf.table_structure_type.TableStructureType)
adds the table structure format (currently csv only) for extracting structured information.
- Parameters
table_structure (TableStructureType) – TableStructureType to be extracted
- Returns
This Builder instance to add any additional parameters.
- Return type
- Raises
ValueError – if table_structure is None.
- static builder()
Returns a Builder for
ExtractPDFOptions
- Returns
The builder class for ExtractPDFOptions
- Return type
- property elements_to_extract
List of pdf element types to be extracted in a structured format from input file
- property elements_to_extract_renditions
List of pdf element types whose renditions needs to be extracted from input file
- property get_char_info
Boolean specifying whether to add character level bounding boxes to output json
- property include_styling_info
Boolean specifying whether to add PDF Elements Styling Info to output json
- property table_output_format
export table in specified format - currently csv supported
Autotag PDF
AutotagPDFOperation
- class adobe.pdfservices.operation.pdfops.autotag_pdf_operation.AutotagPDFOperation(create_key)
Bases:
adobe.pdfservices.operation.operation.Operation
An operation that enables clients to improve accessibility of the PDF document. It generates the tagged PDF, along with an optional XLSX report providing detailed information about the added tags. The operation replaces any existing tags within the input document, so it provides the most benefit for PDFs that have no tags or low-quality tags.
Sample usage.
try: base_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) credentials = Credentials.service_account_credentials_builder() \ .from_file(base_path + "/pdfservices-api-credentials.json") \ .build() execution_context = ExecutionContext.create(credentials) autotag_pdf_operation = AutotagPDFOperation.create_new() input_file_path = 'autotagPdfInput.pdf' source = FileRef.create_from_local_file(base_path + "/resources/" + input_file_path) autotag_pdf_operation.set_input(source) autotag_pdf_options: AutotagPDFOptions = AutotagPDFOptions.builder() \ .with_shift_headings() \ .with_generate_report() \ .build() autotag_pdf_operation.set_options(autotag_pdf_options) autotag_output_files: AutotagPDFOutputFiles = autotag_pdf_operation.execute(execution_context) input_file_name = Path(input_file_path).stem base_output_path = base_path + "/output/AutotagPDFWithOptions/" Path(base_output_path).mkdir(parents=True, exist_ok=True) tagged_pdf_path = f'{base_output_path}{input_file_name}-tagged.pdf' report_path = f'{base_output_path}{input_file_name}-report.xlsx' autotag_output_files.save_pdf_file(tagged_pdf_path) autotag_output_files.save_xls_file(report_path) except (ServiceApiException, ServiceUsageException, SdkException) as e: logging.exception(f'Exception encountered while executing operation: {e}')
- SUPPORTED_SOURCE_MEDIA_TYPES = {adobe.pdfservices.operation.internal.extension_media_type_mapping.ExtensionMediaTypeMapping.PDF.mime_type}
Supported source file formats for
AutotagPdfOperation
is .pdf.
- classmethod create_new()
creates a new instance of AutotagPDFOperation.
- Returns
A new instance of AutotagPDFOperation
- Return type
- execute(execution_context: adobe.pdfservices.operation.execution_context.ExecutionContext)
Executes this operation synchronously using the supplied context and returns a new AutotagPDFOutputFiles instance for the generated tagged pdf file and XLSX report file. The resulting file may be stored in the system temporary directory. See
adobe.pdfservices.operation.io.file_ref.FileRef
for how temporary resources are cleaned up.- Parameters
execution_context (ExecutionContext) – The context in which the operation will be executed.
- Returns
The instance of AutotagPDFOutputFiles.
- Return type
AutotagPDFOutputFiles
- Raises
ServiceApiException – if an API call results in an error response.
- get_options()
gets the AutotagPDFOptions.
- Returns
The options parameter of the operation
- Return type
- set_input(source_file_ref: adobe.pdfservices.operation.io.file_ref.FileRef)
Sets an input file.
- Parameters
source_file_ref (FileRef) – An input file.
- Returns
This instance to add any additional parameters.
- Return type
- set_options(autotag_pdf_options: adobe.pdfservices.operation.pdfops.options.autotagpdf.autotag_pdf_options.AutotagPDFOptions)
sets the AutotagPDFOptions.
- Parameters
autotag_pdf_options (AutotagPDFOptions) – AutotagPDFOptions to set.
- Returns
This instance to add any additional parameters.
- Return type
AutotagPDFOptions
- class adobe.pdfservices.operation.pdfops.options.autotagpdf.autotag_pdf_options.AutotagPDFOptions(generate_report, shift_headings)
Bases:
object
An Options Class that defines the options for AutotagPDFOperation.
autotag_pdf_options: AutotagPDFOptions = AutotagPDFOptions.builder() \ .generate_report() \ .shift_headings() \ .build()
- class Builder
Bases:
object
The builder for
AutotagPDFOptions
.- build()
builds and returns the AutotagPDFOptions instance
- with_generate_report()
sets the Boolean specifying whether to generate an XLSX report containing the information about the tags
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_shift_headings()
sets the Boolean specifying whether to shift headings in the output PDF file
- Returns
This Builder instance to add any additional parameters.
- Return type
- static builder()
Returns a Builder for
AutotagPDFOptions
- Returns
The builder class for AutotagPDFOptions
- Return type
- property generate_report
Boolean specifying whether to generate an XLSX report as an output
- property shift_headings
Boolean specifying whether to shift the headings in the output PDF file
Config
ClientConfig
- class adobe.pdfservices.operation.client_config.ClientConfig
Bases:
object
Encapsulates the API request configurations
- class Builder
Bases:
object
Builds a
ClientConfig
instance.- build()
Returns a new
ClientConfig
instance built from the current state of this builder.- Returns
A ClientConfig instance.
- Return type
- from_file(client_config_file_path: str)
Sets the connect timeout and read timeout using the JSON client config file path. All the keys in the JSON structure are optional.
- Parameters
client_config_file_path (str) – JSON client config file path
- Returns
This Builder instance to add any additional parameters.
- Return type
JSON structure:
{ "connectTimeout": "4000", "readTimeout": "20000" }
- with_connect_timeout(connect_timeout: int)
Sets the connect timeout. It should be greater than zero.
- Parameters
connect_timeout (int) – determines the timeout in milliseconds until a connection is established in the API calls. Default value is 4000 milliseconds
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_read_timeout(read_timeout: int)
Sets the read timeout. It should be greater than zero.
- Parameters
read_timeout (int) – Defines the read timeout in milliseconds, The number of milliseconds the client will wait for the server to send a response after the connection is established. Default value is 10000 milliseconds
- Returns
This Builder instance to add any additional parameters.
- Return type
- static builder()
Creates a new
ClientConfig
builder.- Returns
A ClientConfig.Builder instance.
- Return type
ClientConfigBuilder
- class adobe.pdfservices.operation.client_config.ClientConfig.Builder
Bases:
object
Builds a
ClientConfig
instance.- build()
Returns a new
ClientConfig
instance built from the current state of this builder.- Returns
A ClientConfig instance.
- Return type
- from_file(client_config_file_path: str)
Sets the connect timeout and read timeout using the JSON client config file path. All the keys in the JSON structure are optional.
- Parameters
client_config_file_path (str) – JSON client config file path
- Returns
This Builder instance to add any additional parameters.
- Return type
JSON structure:
{ "connectTimeout": "4000", "readTimeout": "20000" }
- with_connect_timeout(connect_timeout: int)
Sets the connect timeout. It should be greater than zero.
- Parameters
connect_timeout (int) – determines the timeout in milliseconds until a connection is established in the API calls. Default value is 4000 milliseconds
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_read_timeout(read_timeout: int)
Sets the read timeout. It should be greater than zero.
- Parameters
read_timeout (int) – Defines the read timeout in milliseconds, The number of milliseconds the client will wait for the server to send a response after the connection is established. Default value is 10000 milliseconds
- Returns
This Builder instance to add any additional parameters.
- Return type
ExecutionContext
- class adobe.pdfservices.operation.execution_context.ExecutionContext
Bases:
object
Represents the execution context of an Operation. An execution context typically consists of the desired authentication credentials and client configurations such as timeouts.
For each set of credentials, a ExecutionContext instance can be reused across operations.
Sample Usage:
try: base_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) credentials = Credentials.service_account_credentials_builder() \ .from_file(base_path + "/pdfservices-api-credentials.json") \ .build() execution_context = ExecutionContext.create(credentials) extract_pdf_operation = ExtractPDFOperation.create_new() source = FileRef.create_from_local_file(base_path + "/resources/extractPdfInput.pdf") extract_pdf_operation.set_input(source) extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \ .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \ .with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES, ExtractRenditionsElementType.FIGURES]) \ .with_get_char_info(True) \ .build() extract_pdf_operation.set_options(extract_pdf_options) result: FileRef = extract_pdf_operation.execute(execution_context) result.save_as(base_path + "/output/ExtractTextTableWithFigureTableRendition.zip") except (ServiceApiException, ServiceUsageException, SdkException): logging.exception("Exception encountered while executing operation")
- static create(credentials: adobe.pdfservices.operation.auth.credentials.Credentials, client_config: Optional[adobe.pdfservices.operation.client_config.ClientConfig] = None)
Creates a context instance using the provided Credentials and ClientConfig
- Parameters
credentials (Credentials) – A Credentials instance
client_config (ClientConfig, optional) – A ClientConfig instance for providing custom http timeouts, defaults to None
- Returns
A new
ExecutionContext
instance- Return type
Auth
Credentials
- class adobe.pdfservices.operation.auth.credentials.Credentials
Bases:
abc.ABC
Marker base class for different types of credentials. Currently it supports only
ServiceAccountCredentials
. The factory methods within this class can be used to create instances of credentials classes.- static service_account_credentials_builder()
Creates a new
ServiceAccountCredentials
builder.- Returns
An instance of ServiceAccountCredentials Builder.
- Return type
ServiceAccountCredentials
- class adobe.pdfservices.operation.auth.service_account_credentials.ServiceAccountCredentials(client_id, client_secret, private_key, organization_id, account_id, ims_base_uri=adobe.pdfservices.operation.internal.constants.service_constants.ServiceConstants.JWT_BASE_URI, claim=None)
Bases:
adobe.pdfservices.operation.auth.credentials.Credentials
,abc.ABC
Service Account credentials allow your application to call PDF Tools Extract API on behalf of the application itself, or on behalf of an enterprise organization. For getting the credentials, Click Here.
- class Builder
Bases:
object
Builds a
ServiceAccountCredentials
instance.- build()
Returns a new
ServiceAccountCredentials
instance built from the current state of this builder.- Returns
A ServiceAccountCredentials instance.
- Return type
- from_file(credentials_file_path: str)
Sets Service Account Credentials using the JSON credentials file path. All the keys in the JSON structure are optional.
JSON structure:
{ "client_credentials": { "client_id": "CLIENT_ID", "client_secret": "CLIENT_SECRET" }, "service_account_credentials": { "organization_id": "org_ident@AdobeOrg", "account_id": "id@techacct.adobe.com", "private_key_file": "private.key" } }
private_key_file is the path of private key file. It will be looked up in the classpath and the directory of JSON credentials file.
- Parameters
credentials_file_path (str) – JSON credentials file path
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_account_id(account_id: str)
Set Account Id (format: id@techacct.adobe.com)
- Parameters
account_id (str) – Account ID (format: id@techacct.adobe.com)
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_client_id(client_id: str)
Set Client ID (API Key)
- Parameters
client_id (str) – Client Id (API Key)
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_client_secret(client_secret: str)
Set Client Secret
- Parameters
client_secret (str) – Client Secret
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_organization_id(organization_id: str)
Set Organization Id (format: org_ident@AdobeOrg) that has been configured for access to PDF Tools API
- Parameters
organization_id (str) – Organization ID (format: org_ident@AdobeOrg)
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_private_key(private_key: str)
Set private key
- Parameters
private_key (str) – Content of the Private Key (PEM format)
- Returns
This Builder instance to add any additional parameters.
- Return type
- property account_id
Account ID(format: id@techacct.adobe.com)
- property claim
Identifies the Service for which Authorization(Access) Token will be issued
- property client_id
Client Id (API Key)
- property client_secret
Client Secret
- property organization_id
Identifies the organization(format: org_ident@AdobeOrg) that has been configured for access to PDF Tools API.
- property private_key
Content of the Private Key (PEM format)
ServiceAccountCredentialsBuilder
- class adobe.pdfservices.operation.auth.service_account_credentials.ServiceAccountCredentials.Builder
Bases:
object
Builds a
ServiceAccountCredentials
instance.- build()
Returns a new
ServiceAccountCredentials
instance built from the current state of this builder.- Returns
A ServiceAccountCredentials instance.
- Return type
- from_file(credentials_file_path: str)
Sets Service Account Credentials using the JSON credentials file path. All the keys in the JSON structure are optional.
JSON structure:
{ "client_credentials": { "client_id": "CLIENT_ID", "client_secret": "CLIENT_SECRET" }, "service_account_credentials": { "organization_id": "org_ident@AdobeOrg", "account_id": "id@techacct.adobe.com", "private_key_file": "private.key" } }
private_key_file is the path of private key file. It will be looked up in the classpath and the directory of JSON credentials file.
- Parameters
credentials_file_path (str) – JSON credentials file path
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_account_id(account_id: str)
Set Account Id (format: id@techacct.adobe.com)
- Parameters
account_id (str) – Account ID (format: id@techacct.adobe.com)
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_client_id(client_id: str)
Set Client ID (API Key)
- Parameters
client_id (str) – Client Id (API Key)
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_client_secret(client_secret: str)
Set Client Secret
- Parameters
client_secret (str) – Client Secret
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_organization_id(organization_id: str)
Set Organization Id (format: org_ident@AdobeOrg) that has been configured for access to PDF Tools API
- Parameters
organization_id (str) – Organization ID (format: org_ident@AdobeOrg)
- Returns
This Builder instance to add any additional parameters.
- Return type
- with_private_key(private_key: str)
Set private key
- Parameters
private_key (str) – Content of the Private Key (PEM format)
- Returns
This Builder instance to add any additional parameters.
- Return type
IO
FileRef
- class adobe.pdfservices.operation.io.file_ref.FileRef
Bases:
abc.ABC
This class represents a local file. It is typically used by an SDK Operation which accepts or returns files.
When a FileRef instance is created by this SDK while referring to a temporary file location, calling any of the methods to save the fileRef (For example,
create_from_stream()
etc.) will delete the temporary file.- static create_from_local_file(local_source: str, media_type: Optional[str] = None)
Creates a FileRef instance from a local file path. If no media type is provided, it will be inferred from the file extension.
- Parameters
local_source (str) – Local file path, either absolute path or relative to the working directory.
media_type (str, optional, defaults to None) – Media type to identify the local file format, defaults to None
- Returns
A FileRef instance.
- Return type
- static create_from_stream(input_stream: _io.BufferedReader, media_type: str)
Creates a FileRef instance from a readable stream using the specified media type. The stream is not read by this method but by consumers of file content i.e. the execute method of an operation such as
execute()
.- Parameters
input_stream (BufferedReader) – Readable Stream representing the file.
media_type (str) – Media type to identify the file format.
- Returns
A FileRef instance.
- Return type
- get_media_type()
returns the media type
- abstract save_as(local_file_path: str)
- abstract write_to_stream(writer_stream)
Exceptions
Exceptions
- exception adobe.pdfservices.operation.exception.exceptions.SdkException(message, request_tracking_id=None)
Bases:
Exception
SdkException is typically thrown for client-side or network errors.
- property request_tracking_id
The request tracking id of the exception.
- exception adobe.pdfservices.operation.exception.exceptions.ServiceApiException(message, request_tracking_id, status_code=0, error_code='UNKNOWN')
Bases:
Exception
ServiceApiException is thrown when an underlying service API call results in an error.
- DEFAULT_ERROR_CODE = 'UNKNOWN'
Returns the HTTP Status code or DEFAULT_STATUS_CODE if the status code doesn’t adequately represent the error.
- DEFAULT_STATUS_CODE = 0
The default value of status code if there is no status code for this service exception.
- property error_code
Returns the detailed message of this error.
- property request_tracking_id
The request tracking id of the exception.
- property status_code
Returns the HTTP Status code or DEFAULT_STATUS_CODE if the status code doesn’t adequately represent the error.
- exception adobe.pdfservices.operation.exception.exceptions.ServiceUsageException(message, request_tracking_id, status_code=429, error_code='UNKNOWN')
Bases:
Exception
ServiceUsageError is thrown when either service usage limit has been reached or credentials quota has been exhausted.
- DEFAULT_ERROR_CODE = 'UNKNOWN'
The default value of error code if there is no status code for this service failure.
- DEFAULT_STATUS_CODE = 429
The default value of status code if there is no status code for this service failure.
- property error_code
Returns the detailed message of this error.
- property request_tracking_id
The request tracking id of the exception.
- property status_code
Returns the HTTP Status code or DEFAULT_STATUS_CODE if the status code doesn’t adequately represent the error.