Taming Excel Bottlenecks

AWS Lambda Excel Processing: Cut Costs 90% for Financial Data | BFSI Cloud Solutions

First published on Dec 2, 2024 in Linkedin.

Imagine this: Your AWS Lambda function keeps crashing while handling big Excel files, running into memory constraints and timeout limits time and time again.  

Ring a bell? We just went through the same scenario of performance lag and increased costs when handling large data in cloud environments.  

But each barrier is an opportunity to learn.  

This blog documents our whole journey of transformation – maturing from memory-intensive Excel processing challenges to quick and economical operations for large Excel file generation. We addressed GraphQL data fetching performance challenges and DynamoDB token management challenges at the same time, transforming problems into triumph.  

The Challenge: Big Data Processing Issues  

Our requirements were to have large Excel files via GraphQL API data fetching and S3 uploads.  Early usage of Apache POI’s XSSF library resulted in  

  • Memory allocation mistakes  
  • Periodic function timeouts  
  • Slow processing times 

The Solution Framework: Strategic Optimization Approach  

We redesigned our strategy entirely with methodical enhancements.  

This is what we did to tackle root causes: 

1. Enhanced DynamoDB Token Management Strategy  

Previous State: 

Session token retrieval for DynamoDB operations was around 20 seconds per query, leading to bottlenecks and workflow interruption. Token expiration and in-process failure impacted system reliability significantly.  

Existing Implementation:  

Our system automatically manages credentials and refreshes tokens before they expire. This ensures uninterrupted access to resources by AppSync and DynamoDB services, improving reliability and scalability while preserving AWS security best practices. Our process now runs smoothly without disruption or threat.  

Successful Outcomes:  

  • Processing duration for the data decreased from 1 hour 23 minutes to 6.41 minutes. 
  • Processing failures eliminated, overall reliability attained  
  • Monthly spending decreased from $18.26 to $1.65  
Method to refresh credentials
Method to refresh credentials

IMPROVEMENTS 

Improvement in time and cost
Improvement in time and cost

2. Excel-to-CSV Conversion: Adopting Streamlined Processing 

    Previous State:  

    Working with large data in Excel meant bringing entire workbooks into memory, resulting in memory overflow and system crashes. 100,000 rows would take 2-3 hours and 1.5GB of memory.  

    Current Implementation:  

    Data conversion into CSV format enables lightweight, incremental processing. Each processing run writes data out to temporary CSV files and appends to S3 master file.  

    Performance Improvements:  

    • Processing time reduced from hours to 8-10 minutes  
    • 988MB of memory saved up  
    • About $10 per month cost savings  

    Core Benefits:  

    • Low Memory Footprint: CSV format avoids expensive operations  
    • Optimized Appending: FileChannel offers bulk large file appending with no performance loss  
    • Continuous Processing: Data is processed sequentially, offering reliable operations   
    Adding csv file using FileChannel
    Adding csv file using FileChannel

    IMPROVEMENTS 

    Improvement in time, cost and memory
    Improvement in time, cost and memory

    3. Streaming Excel for Big Dataset Management

    Earlier State:  

    Apache POI XSSF Workbook implementation caused Lambda memory limit exceptions due to XSSF’s behavior of loading whole workbooks into memory, which caused excessive dataset processing failures. The process was unstable and time-consuming.  

    Current Implementation: 

    We utilized SXSSF (Streaming Excel Spreadsheet Format) in Excel conversion from CSV, leveraging effective row caching and file streaming capabilities.  

    Technical Benefits  

    • Effective File Operations: Data goes into files instead of memory, avoiding memory overhead  
    • Row Caching System: Caches in memory only a sliding window of rows  
    • Large Dataset Support: Continuously supports big Excel files without impacting performance  

    Measurable Outcomes:  

    • Processing time reduced from 2-3 hours to 3-5 minutes  
    • Used Memory decreased by 988MB  
    • Utilized 10GB of Ephemeral Lambda Storage  
    • Attained near-perfect reliability 
    • Monthly charges dropped from $2.97 down to $0.0011   
    Writing workbook to a file
    Writing workbook to a file

    IMPROVEMENTS 

    Improvement in time, cost and memory
    Improvement in time, cost and memory

    4. GraphQL Operations Utility Singleton Implementation  

    Earlier State:  

    Having multiple S3 and GraphQL classes per operation was memory-intensive and introduced redundant usage patterns.  

    Existing Implementation:  

    We used a thread-safe Singleton pattern for GraphQL operations to reduce memory usage, eliminate redundancy, and aggregate accesses for uniform operations. The design is easy to maintain, supports thread-safe operations, and improves performance, making it ideal for synchronized data processing.  

    Performance Benefits:  

    • Monthly memory reserves of 256MB 
    • Reducing operation time by 50%, saving an average of 5 minutes 
    • Cost savings of $0.0021918  
    Thread safe singleton instance
    Thread safe singleton instance

    IMPROVEMENTS 

    Excel processing improvement in time, cost and memory
    Improvement in time, cost and memory

    The Outcomes:  

    • Improved Performance, Reliability, and Cost-Effectiveness.  
    • Through these strategic improvements, we transformed an inefficient system into a high-performance one.  
    • Key Insights Performance tuning isn’t merely a matter of cost reduction—it makes systems smart rather than forceful.  

    In the case of handling vast amounts of data or designing scalable cloud offerings, deliberate performance optimization reaps enormous rewards.  

    Having the same issues or interested in applying to your business? We’d love to discuss and share ideas! 

    Related Blogs