Published: 2020-07-29 | Originally published at AWS Blog
Whether your organization is a multinational enterprise present in many countries, or a small startup hungry for global success, translating your content to local languages may be an enduring challenge. Indeed, text data often comes in many formats, and processing them may require several different tools. Also, as all these tools may not support the same language pairs, you may have to convert certain documents to intermediate formats, or even resort to manual translation. All these issues add extra cost, and create unnecessary complexity in building consistent and automated translation workflows.
Amazon Translate aims at solving these problems in a simple and cost effective fashion. Using either the AWS console or a single API call, Amazon Translate makes it easy for AWS customers to quickly and accurately translate text in 55 different languages and variants .
Earlier
this year,
Amazon Translate
introduced batch translation for plain text and HTML documents. Today, I’m very happy to announce that batch translation now also supports Office documents, namely
.docx
,
.xlsx
and
.pptx
files as defined by the
Office Open XML
standard.
Introducing Amazon Translate for Office Documents
The process is extremely simple. As you would expect, source documents have to be stored in an
Amazon Simple Storage Service (Amazon S3)
bucket. Please note that no document may be larger than 20 Megabytes, or have more than 1 million characters.
Each batch translation job processes a single file type and a single source language. Thus, we recommend that you organize your documents in a logical fashion in S3, storing each file type and each language under its own prefix.
Then, using either the AWS console or the
StartTextTranslationJob
API in one of the AWS
language SDKs
, you can launch a translation job, passing:
Once the job is complete, you can collect translated files at the output location.
Let’s do a quick demo!
Translating Office Documents
Using the Amazon S3
console
, I first upload a few
.docx
documents to one of my buckets.
Then, moving to the Translate console , I create a new batch translation job, giving it a name, and selecting both the source and target languages.
Then, I define the location of my documents in Amazon S3, and their format,
.docx
in this case. Optionally, I could apply a
custom terminology
, to make sure specific words are translated exactly the way that I want.
Likewise, I define the output location for translated files. Please make sure that this path exists, as Translate will not create it for you.
Finally, I set the AWS Identity and Access Management (IAM) role, giving my Translate job the appropriate permissions to access Amazon S3. Here, I use an existing role that I created previously, and you can also let Translate create one for you. Then, I click on ‘Create job’ to launch the batch job.
The job starts immediately.
A little while later, the job is complete. All three documents have been translated successfully.
Translated files are available at the output location, as visible in the S3 console.
Downloading one of the translated files, I can open it and compare it to the original version.
For small scale use, it’s extremely easy to use the AWS console to translate Office files. Of course, you can also use the Translate API to build automated workflows.
Automating Batch Translation
In a
previous post
, we showed you how to automate batch translation with an
AWS Lambda
function. You could expand on this example, and add language detection with
Amazon Comprehend
. For instance, here’s how you could combine the
DetectDominantLanguage
API with the
Python-docx
open source library to detect the language of
.docx
files.
import boto3, docx
from docx import Document
document = Document('blog_post.docx')
text = document.paragraphs[0].text
comprehend = boto3.client('comprehend')
response = comprehend.detect_dominant_language(Text=text)
top_language = response['Languages'][0]
code = top_language['LanguageCode']
score = top_language['Score']
print("%s, %f" % (code,score))
Pretty simple! You could also detect the type of each file based on its extension, and move it to the proper input location in S3. Then, you could schedule a Lambda function with CloudWatch Events to periodically translate files, and send a notification by email. Of course, you could use AWS Step Functions to build more elaborate workflows. Your imagination is the limit!
Getting Started
You can start translating Office documents today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Seoul).
If you’ve never tried Amazon Translate , did you know that the free tier offers 2 million characters per month for the first 12 months, starting from your first translation request?
Give it a try, and let us know what you think. We’re looking forward to your feedback: please post it to the AWS Forum for Amazon Translate , or send it to your usual AWS support contacts.
- JulienJulien is the Artificial Intelligence & Machine Learning Evangelist for EMEA . He focuses on helping developers and enterprises bring their ideas to life. In his spare time, he reads the works of JRR Tolkien again and again.