Processing image-type documents¶
With BotCity Documents you can also do the analysis and reading of PDF documents of the image type. Just use one of the available providers, along with our OCR plugins.
Defining which OCR provider to use¶
BotCity offers two providers to extract text from image-type pdf's, namely Google Cloud Vision and Amazon Textract. We can define them the moment we load the PDF for analysis and reading in BotCity Studio, as we can see below:
Tip
We suggest that you maintain the option Generate code for imports and load file? checked when loading the PDF document, so a code snippet will be generated in your source code having everything necessary to import, instantiate and configure the plugin to capture the entries generated by BotCity Studio later, so your code will be ready to be executed and tested, as we can see below:
Google Cloud Vision¶
# Remember to install and add to your requirements.txt the following packages:
# botcity-documents
# botcity-cloudvision-plugin
# Import the packages
from botcity.document_processing import DocumentParser
from botcity.plugins.cloudvision import BotCloudVisionPlugin
# Instantiate the Google Cloud Vision plugin with the credential file
plugin = BotCloudVisionPlugin().credentials("<path_to_my>/credentials.json")
# Fetch the entries from the file after reading.
entries = plugin.read(r"<path_to_my>\statement.pdf").entries()
# Instantiate the parser
parser = DocumentParser()
# Load the parser with the entries
parser.load_entries(entries, sort=False)
Info
In addition to the BotCity Documents dependency, it will also be necessary to include in the pom.xml
the plugin dependency for the OCR service used.
<dependency>
<groupId>dev.botcity</groupId>
<artifactId>bot-ocr-plugins</artifactId>
<version>1.0.9</version>
</dependency>
/* Add the following into the imports section of your code.
import java.io.File;
import java.util.List;
// Import the OCR and Credential Models, DocumentParser and Entry objects
import dev.botcity.botcity_document_processing.parser.DocumentParser;
import dev.botcity.botcity_document_processing.parser.Entry;
import dev.botcity.plugin.ocr.model.GcpCredentialModel;
import dev.botcity.plugin.ocr.model.OcrDataModel;
*/
// Load the file
File file = new File("caminho_arquivo_pdf");
// Set up the credential using the JSON file
GcpCredentialModel credential = GcpCredentialModel.fromFile("<pathToMy>/credentials.json");
// Set up the plugin using the File and Credential
OcrDataModel plugin = OcrDataModel.fromFileAndCredential(file, credential);
// Process the file and fetch the entries list
List<Entry> entries = plugin.process().getEntries();
// Create a DocumentParser object and set the entries list
DocumentParser parser = new DocumentParser();
parser.setEntries(entries);
String value = "";
Warnings when using the Google Cloud Vision service
To make use of the Google Cloud Vision provider, we will need to use BotCity plugin for Google Cloud Vision, as well as have a Google credential service account and its properly active billing.
For more information, visit our documentation of the Google Cloud Vision plugin and also the content on credentials and billing.
Amazon Textract¶
# Remember to install and add to your requirements.txt the following packages:
# botcity-documents
# botcity-aws-textract-plugin
# Import the packages
from botcity.document_processing import DocumentParser
from botcity.plugins.aws.textract import BotAWSTextractPlugin
# Instantiate the plugin using the .aws file for credentials
plugin = BotAWSTextractPlugin()
# Fetch the entries from the file after reading.
entries = plugin.read(r"<path_to_my>\statement.pdf").entries
# Instantiate the parser
parser = DocumentParser()
# Load the parser with the entries
parser.load_entries(entries, sort=False)
Info
In addition to the BotCity Documents dependency, it will also be necessary to include in the pom.xml
the plugin dependency for the OCR service used.
<dependency>
<groupId>dev.botcity</groupId>
<artifactId>bot-ocr-plugins</artifactId>
<version>1.0.9</version>
</dependency>
/* Add the following into the imports section of your code.
import java.io.File;
import java.util.List;
// Import the OCR and Credential Models, DocumentParser and Entry objects
import dev.botcity.botcity_document_processing.parser.DocumentParser;
import dev.botcity.botcity_document_processing.parser.Entry;
import dev.botcity.plugin.ocr.model.AwsCredentialModel;
import dev.botcity.plugin.ocr.model.OcrDataModel;
*/
// Load the file
File file = new File("caminho_arquivo_pdf");
// For use with the AWS CLI credential
AwsCredentialModel credential = AwsCredentialModel.builder().credentialAwsCli(true).build();
// Set up the plugin using the File and Credential
OcrDataModel plugin = OcrDataModel.fromFileAndCredential(file, credential);
// Process the file and fetch the entries list
List<Entry> entries = plugin.process().getEntries();
// Create a DocumentParser object and set the entries list
DocumentParser parser = new DocumentParser();
parser.setEntries(entries);
String value = "";
Warnings when using the Amazon Textract service
To make use of the Amazon Textract provider, we will need to use BotCity plugin for Amazon Textract, as well as have a key id, and the secret access key to be able to authenticate the API of it.
For more information, visit our documentation of the Amazon Textract plugin and also the content about how to generate your access keys and how to configure them to use the plugin.
Creating templates¶
Now, creating a reading template for a document is very simple. After the code above, you can now generate the automatic codes through BotCity Studio for each input you want to extract.
Just click and drag the mouse to select the field you want to read (outlined in red). Then, select the reading area related to that field (outlined in blue) as shown in the image below:
This process is repeated for each field in the document you need to read and your custom parser is built in quickly. After selecting all the fields to be read, you will see that the code was generated automatically.
Complete code¶
# Remember to install and add to your requirements.txt the following packages:
# botcity-documents
# botcity-cloudvision-plugin
# Import the packages
from botcity.document_processing import DocumentParser
from botcity.plugins.cloudvision import BotCloudVisionPlugin
def parse_file(filename):
# Instantiate the Google Cloud Vision plugin with the credential file
plugin = BotCloudVisionPlugin().credentials("<path_to_my>/credentials.json")
# Fetch the entries from the file after reading.
entries = plugin.read(filename).entries()
# Instantiate the parser
parser = DocumentParser()
# Load the parser with the entries
parser.load_entries(entries, sort=False)
# Account No
_account_no_ = parser.get_first_entry("Account No :")
value = parser.read(_account_no_, 1.042105, -0.583333, 1.547368, 1.833333)
print(f'Account no: {value}')
# Statement Date
_statement_date_ = parser.get_first_entry("Statement Date :")
value = parser.read(_statement_date_, 1.015873, -0.076923, 1.198413, 1.230769)
print(f'Statement date: {value}')
# Due Date
_due_date_ = parser.get_first_entry("Due Date :")
value = parser.read(_due_date_, 1.037037, -0.166667, 1.864198, 1.583333)
print(f'Due Date: {value}')
# Total Amount Due
_total_amount_due_ = parser.get_first_entry("Total Amount Due :")
value = parser.read(_total_amount_due_, -0.043956, 1.3, 1.076923, 2.1)
print(f'Total amount due: {value}')
parse_file(filename="statement.pdf")
# Remember to install and add to your requirements.txt the following packages:
# botcity-documents
# botcity-aws-textract-plugin
# Import the packages
from botcity.document_processing import DocumentParser
from botcity.plugins.aws.textract import BotAWSTextractPlugin
def parse_file(filename):
# Instantiate the plugin using the .aws file for credentials
plugin = BotAWSTextractPlugin()
# Fetch the entries from the file after reading.
entries = plugin.read(filename).entries
# Instantiate the parser
parser = DocumentParser()
# Load the parser with the entries
parser.load_entries(entries, sort=False)
# Account No
_account_no_ = parser.get_first_entry("Account No :")
value = parser.read(_account_no_, 1.042105, -0.583333, 1.547368, 1.833333)
print(f'Account no: {value}')
# Statement Date
_statement_date_ = parser.get_first_entry("Statement Date :")
value = parser.read(_statement_date_, 1.015873, -0.076923, 1.198413, 1.230769)
print(f'Statement date: {value}')
# Due Date
_due_date_ = parser.get_first_entry("Due Date :")
value = parser.read(_due_date_, 1.037037, -0.166667, 1.864198, 1.583333)
print(f'Due Date: {value}')
# Total Amount Due
_total_amount_due_ = parser.get_first_entry("Total Amount Due :")
value = parser.read(_total_amount_due_, -0.043956, 1.3, 1.076923, 2.1)
print(f'Total amount due: {value}')
parse_file(filename="statement.pdf")
The result of the analysis and reading of the document¶
When running both templates above and printing the returned values, we will have the following result:
Account no: 1023456789-0
Statement date: 03/08/2016
Due Date: 03/29/2016
Total amount due: $ 115.28