The contents of rekog.py:
import boto3
# CHANGE THESE
s3BucketName = "picostat.com"
documentName = "bg.pdf"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
})
# print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print (item["Text"])
All you have to do to adapt it for your case is upload your image to an S3 bucket. In my case that was “picostat.com” which is listed in the code above. You then change the name of the image to whatever your filename is then run it with this command:
python3 rekog.py > bg.txt
The greater than tells Python to put the transcribed text in the file called bg.txt
.
If you get an error message that boto
is not installed you can install it on a Mac with this command:
pip3 install boto3 --user
To convert a large PDF (larged than 5 MB) you will need an asynchronous script:
Contents of rekog2.py
import boto3
import time
def startJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
# For production use cases, use SNS based notification
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(30)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# CHANGE THESE
s3BucketName = "picostat.com"
documentName = "bg.pdf"
jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
response = getJobResults(jobId)
#print(response)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print (item["Text"])
Once again, all you need to do is change the document name and the bucket name and then run this command.
python3 rekog2.py > bg.txt
If you’re not sure how to upload a document to S3 you can do it with this command if you have AWS CLI installed:
aws s3 cp bg.pdf s3://picostat.com
Or you can go to the AWS Console to use the graphical user interface uploaded in the S3 section.