Creating valuable insights out of raw data files, such as audio or video, has traditionally been a very manual and tedious process, and has produced mixed results due to an influential human element in the mix.
Thanks to enhancements in machine learning systems, coupled with the rapidly deployable nature of serverless technology as a middleware layer, we are able to create highly sophisticated data insight platforms to replace the huge time requirements that have typically been required in the past.
With this in mind, we’ll look at:
- How to build end-to-end data insight and predictor systems, built on the back of serverless and machine learning systems.
- Best practices for working with serverless technology for ferrying information between raw data files and machine learning systems through an eventing system.
- Considerations and practical examples of working with the security implications of dealing with sensitive information.
1. Better Data with Machine
Learning and Serverless
Jonathan LeBlanc (Director of
Developer Advocacy @ Box)
Twitter: @jcleblanc
Email: jleblanc@box.com
2. Agenda for Today
Building Blocks: How are these systems built?
Best Practices: How do we architect the solution?
Security Considerations: How do ensure data security?
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
4. 1 What Machine Learning Isn’t
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
5. 1 Components of the System
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Serverless Framework
Provides the compute and data
management from stored data location to
machine learning engine.
Machine Learning System
Provides the data enhancement capabilities
which improves the underlying source data’s
metadata (information about information).
6. 1 Why Serverless?
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
On Demand: Machine learning ties are only required when files need
processing, which may be infrequent.
No hosting: You don’t have to run or manage any servers, containers, or VMs of
your own.
Pricing based on use: Execution resources are only run (and charged for) based
on your use, typically resulting in very low server costs.
Different stack options: Multiple serverless systems exist to fit stack needs,
including numerous open source options.
7. 1 Components of the System
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Webhook / Event Pump System: Handles notifications to the middleware layer
when a new file should be processed.
Middleware Layer: Handles communication between the data source and
machine learning systems.
Metadata Layer: The storage facility for machine learning data responses.
Token Downscoping System: Allows you to pass tightly scoped read / write
tokens through multiple uncontrolled system layers.
8. 1 How a Data / ML System Works
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Cloud Data
Data store &
initial metadata
Serverless Framework
Callback handler and code
execution
Machine Learning
Data processor and
enhancer
Webhook
Metadata
Execute
Callback
9. 1 Common Serverless Frameworks
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
AWS Lambda:
https://aws.amazon.com/lambda/
Azure Functions:
https://azure.microsoft.com/en-us/services/functions/
Google Cloud Functions:
https://cloud.google.com/functions/
IronFunctions:
https://github.com/iron-io/functions
OpenWhisk:
https://openwhisk.apache.org/
Fission:
https://fission.io/
Considerations
1. Your stack
2. Pricing / free use
3. Supported languages
4. Regional support
10. 1 Machine Learning Frameworks
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Audio / Video / Image
• [video] MS Video Indexer
• [audio] Voicebase
• [face] Hive AI
• [image] Clarifai
• [image] Google Vision
• [mixed] IBM Watson
• [moderation] MS Content
Moderator
• [face] Kairos
• [audio] AT&T Speech
• [image] Amazon
Rekognition
Text Extraction
• [id] Acuant
• [invoice] Rossum.AI
• [contract] eBrevia
• [lease] Leverton
• [resume] TextKernal
• [prediction] AmazonML
• [analysis] Aylien
• [classification]
MonkeyLearn
• [natural language] ApiAI
• [sentiment]
AlchemyText
Open Source
• TensorFlow
• Keras
• Scikit-learn
• MS Cognitive Toolkit
• Theano
• Caffe
• Torch
• Accord.NET
12. 2 Program Logic and Serverless Separation
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Serverless function agnostic: The core logic of the function should be separate
from the serverless requirements. Thin handlers / routers may be written on
top of the core logic to maintain separation.
Service deployments: To allow for deployment amongst numerous serverless
technologies, systems like serverless.com may be utilized.
Testability: The separation of concerns allows you to test the function
separately from the container.
Handler: Separate handler from core program logic for testability.
13. // API Gateway Handler
exports.handler = (event, context, callback) => {
// Check for valid event
if (isValidEvent()) {
processEvent();
} else {
callback(null, { statusCode: 200, body: 'Event received but invalid' });
}
};
AWS Lambda Handler
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
14. 2 Dealing with Cold Starts
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
What is it: The latency experienced when a function is triggered, which only
runs when there isn’t a warn / idle container. A container is automatically
dropped after a period of inactivity.
Options: You can either keep the container warm through memory increases
and calls, or deal with the cold start.
Fewer libraries: The more libraries that are used the longer it will take to start
the container.
Smaller functions: Writing smaller functions decreases start time.
15. 2 Exit Callback Hygiene
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Error logging: With many serverless environments proper callback use will
provide full data logging.
Reliability: Failing to exist properly can result in your function executing until a
timeout is hit. Timeouts may also cause subsequent invocations to require a
cold start, which results in additional latency.
Cost: If a timeout occurs, you will be charged for the entire timeout time.
17. 2 Writing Stateless Single Purpose Functions
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Error isolation: Debugging and error handling is
easier with function / concern isolation.
Scaling: With monolith functions, you have to
optimize entire for all elements of the functions,
rather than the specific functionality receiving the
most calls / traffic.
Planning and testing: It’s easier to plan and write
test plans for functions with singular concerns.
18. /**
* Check for a valid event.
* @param {object} indexerEvent – indexer event
* @return {boolean} - true if valid event
*/
const isValidEvent = (indexerEvent) => {
return (indexerEvent.body || indexerEvent.queryStringParameters);
};
Valid Event Function
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
20. 3 Security Considerations
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Serverless use consideration: Are serverless systems a viable / approved
mechanism within your organization?
Token exposure: Many API auth systems are token based, with broadly scoped
tokens, leading to the potential of token leakage.
Credential exposure: With the use of numerous APIs, each with auth
credentials, we have the potential of credential leakage.
Sensitive information exposure: Data is being passed through multiple systems
and we have to be aware of how the information is used / stored.
21. 3 Middleware System
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Serverless Solution
All compute functionality is offloaded to the serverless
framework.
On-prem Solution
All computer functionality (and connection to the ML
system) is run off of existing internal servers.
22. 3 Protecting Credentials
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Use Secure Storage: Use a secure system to store API credentials or tokens,
such as the AWS Systems Manager Parameter Store.
Least Privilege Principle: Functions requiring access to credentials should follow
the least privilege principle, meaning they have access to only as much data as
they absolutely need.
Separate Environment Credentials: Credentials used in a more open developer
environment should not be the same used in a production deployment.
23. 3 Token Downscoping
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Access Token
Fully scoped
access token
Downscoped Token
Tightly scoped child
token
Channel Transmission
Transmit through
uncontrolled channels
24. 3 Token Downscoping Components
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Tightly scoped for single file: A token should only be scoped for the item
needed for processing, such as a file.
Short lived: Downscoped tokens should only live for their natural useful time
(e.g. 1 hour)
Revocable: Downscoped tokens may be revoked before natural expiration
through the API.
Split read / write functions: To further scope token exposure, separate read /
write tokens can be issued.
25. 3 Sensitive Information Exposure
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Data in the files: What information is being transmitted through the channels in
the files, and is it sensitive information?
Are channels secure: Are all connections between your systems, the serverless
framework, and the machine learning system secure?
How the ML system handles data: Does the machine learning system store any
data long-term, and how secure is that storage?
Logging sensitive information: Are you logging sensitive information during
general program flow unintentionally?
26. 3 Tokenisation Specification
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
Data Request
Sensitive information
request
Cloud Data API
Data hosting service API
Secure Data Vault
Secure vault hosting
data files
1. PAN
4. Token / Status
2. PAN
3. Token / Status
27. Wrapup Topics
Building Blocks: How are these systems built?
Best Practices: How do we architect the solution?
Security Considerations: How do ensure data security?
Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
28. Better Data with Machine Learning and Serverless
Slides: http://bit.ly/ato-bdml
Jonathan LeBlanc (Director of
Developer Advocacy @ Box)
Twitter: @jcleblanc
Email: jleblanc@box.com