Using NLTK data with Serverless Framework in AWS Lambda with Layers
Any technology that let developer to concentrate on what they are good at (coding) without being bogged down by other nuances like deployments and doing baby sitting of the servers.
Serverless is such a technology which empowers the developers. I wont go in detail of this as there is a abundance stuff available on web about the advantages of going with serverless. Almost every public cloud provider has the serverless offering now , with AWS being the pioneer in it.
Serverless Framework is the icing on the cake with serverless. It makes development and deployment so seamless.
NLTK (Natural Language Toolkit) is the one of oldest library available since 2001 in Python for Natural Language Processing (NLP). In this blog post I will showcase how you can deploy your NTLK based solution on AWS Lambda with layers using serverless framework.
For NLTK to work it requires large amount of meta data also known as trained models for all languages supported by NLTK. You can read more about NLTK data over here. This data must be preinstalled on the machine where you are running NLTK based solution. Now when I talked about serverless , though there are servers involved but it is not exposed to us and the cloud provider internally manages it. Hence you cannot SSH or Remote into the computer and install the data. Wouldn’t it be sad serverless story if I end here.
AWS Lambda layers are here to help us. They allow us to pack the additional data along with lambda code deployment package. These layers can be shared across multiple lambda functions or accounts. It was introduced at AWS Reinvent Conference in 2018.
Here is the serverless.yml file
First thing first. NLTK library looks for environment variable by name NLTK_DATA for the corpora data lookup. This variable is set on the line number 48. By default AWS will store the layers unzip content under the opt directory. I’ve created a sub directory in the my project directory with the name nltk, under this directory is have create nltk_data where the actual data lies. See the project structure for better understanding.
Now next comes the important part , the configuration of layers which is done from line # 109 to 118. I’ve commented each property of this layers object for your understanding. The most important and the required property here is , path the value for which should be the path of the folder which will be zipped as lambda layer. In my case the folder name is nltk. Do take a note of it that the max size of lambda layer after unzipping should not exceed more than 250 MB. The corpora of NLTK data is more then 2 GB in size. So you should be selective for which module you plan to use in your solution. In my case I only required stopwords and tokenize which comes around 32 MB. This configured layer is being referenced on line #69. Notice the convention used in the Reference as layer name in PascalCase followed by LambdaLayer.
I’m also using serverless-python-requirements to manage python package dependencies. The full source code of this blogpost is available on my github repo.
Hope this helps.
Originally published at http://goldytech.wordpress.com on June 4, 2019.