PDF Extract API Quickstart (BETA)

Adobe provides a PDF Extract API that parses structural elements such as text, titles, tables, figures of any PDF, including scanned documents. The API returns a JSON output with these elements. The PDF Extract SDK is currently available in Java and Node.js, but other languages are in the pipeline (e.g Python).

The PDF Extract API provides a method for developers to extract content and process this extracted content in their custom apps and workflows. For example, developers can:

  • Extract text with structure (headings, paragraphs, and lists)

  • Preserve the PDF’s original reading order structure in the JSON output so that they can more easily find and process content based on the original source

  • Detect tables and extract table cell data

  • Extract tables as images. The images can be used to validate the extracted table data and developer doesn’t need to process the output to identify only tables.

  • Extract tables as CSVs.

  • Extract bounding boxes for characters present in text blocks(paragraphs, list, headings) to output json.

Developers can leverage extracted PDF content and structure both in their current workflow as well as in their downstream applications.

To get started, request a developer credential via the request access form, and then download and setup the sample project for your preferred language. The sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.

Beta program access

The PDF Extract API is initially offered through a limited, Beta program. Once you submit the request access form, the product team provides selected developers with credentials and links to download the SDK. After downloading the SDK and setting up sample projects, you can use this documentation to easily integrate the PDF Extract API into your own server-side code.

While beta program participation is limited, the program will expand as the SDK evolves.

API limitations

The beta program has the following limitations.

  • API rate limit: Beta program users are entitled to 1000 transactions for PDF extraction. A PDF Transaction is based on the initial endpoint request (i.e., API call) and the document output.

  • Unsupported PDF types: The API does not support extracting from digitally signed, encrypted, or policy protected PDFs.

  • Size limits: Maximum supported file size is 100MB.

  • Page limits: Non scanned PDFs are limited to 200 pages with no more than 50 pages containing tables. Scanned PDFs must be 20 pages or less.

JAVA

Start by bookmarking or downloading these following key resources:

Authentication

After you submit the request access form, Adobe Document Cloud will email a zip file containing a pdftools-api-credentials.json file and private.key file.

Replace the pdftools-api-credentials.json file in the sample project with the one you receive from Adobe, and add the private.key file to the same path as your pdftools-api-credentials.json file.

Example pdftools-api-credentials.json file

{
  "client_credentials": {
     "client_id": " <YOUR_CLIENT_ID> ",
     "client_secret": " <YOUR_CLIENT_SECRET> "
  },
  "service_account_credentials": {
     "organization_id": " <YOUR_ORGANIZATION_ID> ",
     "account_id": " <YOUR_TECHINCAL_ACCOUNT_ID> ",
     "private_key_file": "private.key"
  }
}

Setup a Java environment

The quickest way to get up and running is to set up and configure the sample project. The project provides everything from ready-to-run sample code, a config file for your credential details, and pre-configured connections to dependencies such as the pdftools-extract-sdk:<version>-beta.jar file.

Note

Maven uses pom.xml to fetch the pdftools-extract-sdk from the public Maven repository when running the project. The .jar automatically downloads when you build the sample project. Alternatively, you can download the .jar file, and configure your own environment.

  1. Install Java 8 or above.

  2. Run javac -version to verify your install.

  3. Verify the JDK bin folder is included in the PATH variable (method varies by OS).

  4. Install Maven. You may use your preferred tool; for example:

  • Windows: Example: Chocolatey.

  • Macintosh: Example: brew install maven.

Sample project setup

  1. Download the sample project.

  2. Replace pdftools-api-credentials.json & private.key with the zipped files within ZIP sent by Adobe.

  3. Build the sample project with Maven: mvn clean install.

  4. Test the sample code on the command line. Refer to the How Tos for details about running samples. Additional details also reside in the API documentation.

Verifying download authenticity

For security reasons you may wish to confirm the installer’s authenticity. To do so,

  1. After installing the package, navigate to the .jar.sha1 file.

  2. Calculate the hash with any 3rd party utility.

  3. Find and open PDF Extract sha1 file. Note: if you’re using Maven, look in the .m2 directory.

  4. Verify the hash you generated matches the value in the .sha1 file.

bf57acfae5622e7418a03da5505e967eddb6812d

Logging

Refer to the API docs for error and exception details.

  • For logging, use the slf4j API with a log4js-slf4j binding.

  • Logging configurations are provided in src/main/resources/log4js.properties.

  • Specify alternate bindings, if required, in pom.xml.

log4js.properties file

   name=PropertiesConfig
   appenders = console

   # A sample console appender configuration which clients can change as needed.
   rootLogger.level = WARN
   rootLogger.appenderRefs = stdout
   rootLogger.appenderRef.stdout.ref = STDOUT

   appender.console.type = Console
   appender.console.name = STDOUT
   appender.console.layout.type = PatternLayout
   appender.console.layout.pattern = [%-5level] %d{yyyy-MM-dd HH:mm:ss.SSS} [%t] %c{1} - %msg%n

   loggers = pdftoolsextractsdk,validator,apache

   # Change the logging levels as desired. INFO is recommended.
   logger.pdftoolsextractsdk.name = com.adobe.platform.operation
   logger.pdftoolsextractsdk.level = INFO
   logger.pdftoolsextractsdk.additivity = false
   logger.pdftoolsextractsdk.appenderRef.console.ref = STDOUT

   logger.validator.name=org.hibernate
   logger.validator.level=WARN

   logger.apache.name=org.apache
   logger.apache.level=WARN

Test files

The sample files reference input files located in the sample project’s src/main/resources/ directory. You can of course modify the files and paths or use your own files.

Custom projects

While the samples use Maven, you can use your own tools and process.

To build a custom project:

  1. Access the .jar in the central Maven repository.

  2. Use your preferred dependency management tool (Ivy, Gradle, Maven), to include the SDK .jar dependency.

  3. Manually create pdftools-api-credentials.json, private.key.

  4. Add the authentication details as described above.

_images/maven.png

Node.js

Jumpstart your development by bookmarking or downloading the following key resources:

Authentication

After you submit the request access form, Adobe Document Cloud will email a zip file containing a pdftools-api-credentials.json file and private.key file.

Replace the pdftools-api-credentials.json file in the sample project with the one you receive from Adobe, and add the private.key file to the same path as your pdftools-api-credentials.json file.

Example pdftools-api-credentials.json file

{
  "client_credentials": {
     "client_id": " <YOUR_CLIENT_ID> ",
     "client_secret": " <YOUR_CLIENT_SECRET> "
  },
  "service_account_credentials": {
     "organization_id": " <YOUR_ORGANIZATION_ID> ",
     "account_id": " <YOUR_TECHINCAL_ACCOUNT_ID> ",
     "private_key_file": "private.key"
  }
}

Set up a Node.js environment

Running any sample or custom code requires the following steps:

  1. Install Node.js 10.13.0 or higher.

Note

The @adobe/pdftools-extract-node-sdk npm package automatically downloads when you build the sample project.

npm install --save @adobe/pdftools-extract-node-sdk

Sample Project setup

  1. Download the sample project.

  2. Replace pdftools-api-credentials.json & private.key with the zipped files within ZIP sent by Adobe.

  3. From the samples root directory, run npm install.

  4. Test the sample code on the command line.

  5. Refer to the How Tos for details about running samples. Additional details also reside in the API documentation.

Verifying download authenticity

For security reasons you may wish to confirm the installer’s authenticity. To do so,

  1. After installing the package, find and open package-lock.json.

  2. Find the “integrity” key.

  3. Verify the hash in the downloaded file matches the value published here.

sha512-TChqGK9w6ftvwcK5nDgun2KYoTKQOpFlJWDYDm/yqP/NLsGNiDgWONUBbn5/j6neMffyO9caBktUxqV68XRwzg==

Logging

Refer to the API docs for error and exception details.

The SDK uses the log4js API for logging. During execution, the SDK searches for config/pdftools-sdk-log4js-config.json in the working directory and reads the logging properties from there. If you do not provide a configuration file, the default logging logs INFO to the console. Customize the logging settings as needed.

log4js.properties file

 {
   "appenders": {
     "consoleAppender": {
             "_comment": "A sample console appender configuration, Clients can change as per their logging implementation",
       "type": "console",
       "layout": {
         "type": "pattern",
         "pattern": "%d:[%p]: %m"
       }
     }
   },
   "categories": {
     "default": {
       "appenders": [
         "consoleAppender"
             ],
             "_comment": "Change the logging levels as per need. info is recommended for pdftools-extract-node-sdk",
       "level": "info"
     }
   }
 }

Test files

Refer to each sample project’s resource directory for the requisite input/output files.

Custom projects

While building the sample project automatically downloads the Node package, you can do it manually if you wish to use your own tools and process.

  1. Go to https://www.npmjs.com/package/@adobe/pdftools-extract-node-sdk

  2. Download the latest package.

_images/node-extract.png

Python

Jumpstart your development by bookmarking or downloading the following key resources:

Authentication

After you submit the request access form, Adobe Document Cloud will email a zip file containing a pdftools-api-credentials.json file and private.key file.

Replace the pdftools-api-credentials.json file in the sample project with the one you receive from Adobe, and add the private.key file to the same path as your pdftools-api-credentials.json file.

Example pdftools-api-credentials.json file

{
  "client_credentials": {
     "client_id": " <YOUR_CLIENT_ID> ",
     "client_secret": " <YOUR_CLIENT_SECRET> "
  },
  "service_account_credentials": {
     "organization_id": " <YOUR_ORGANIZATION_ID> ",
     "account_id": " <YOUR_TECHINCAL_ACCOUNT_ID> ",
     "private_key_file": "private.key"
  }
}

Set up a Python environment

Running any sample or custom code requires the following steps:

  1. Install Python 3.6 or higher.

Note

The pdfservices-extract-sdk package automatically downloads when you build the sample project.

pip install pdfservices-extract-sdk

Sample Project setup

  1. Download the sample project.

  2. Replace pdftools-api-credentials.json & private.key with the zipped files within ZIP sent by Adobe.

  3. From the samples root directory, run pip install -r requirements.txt.

  4. Test the sample code on the command line.

  5. Refer to the How Tos for details about running samples. Additional details also reside in the API documentation.

Verifying download authenticity

For security reasons you may wish to confirm the installer’s authenticity. To do so,

  1. After downloading the package zip, run following command

pip hash <download_dir>/pdfservices-extract-sdk-1.0.0b1.tar.gz
  1. Above command will return the hash of downloaded package.

  2. Verify the hash matches the value published here.

sha256-b55af6dcc3ea04f20de630a5b8f8fa386c3ea59ce18640dabb05d363a7f5df29

Logging

Refer to the API docs for error and exception details.

The SDK uses the Python standard logging module. Customize the logging settings as needed.

Default Logging Config

logging.getLogger(__name__).addHandler(logging.NullHandler())

Test files

Refer to each sample project’s resource directory for the requisite input/output files.

Custom projects

While building the sample project automatically downloads the Python package, you can do it manually if you wish to use your own tools and process.

  1. Go to https://pypi.org/project/pdfservices-extract-sdk/

  2. Download the latest package.

_images/python-extract.png

Known issues

  • Complex PDFs taking more than 300s for extraction will result in timeout error.