Client: A software development organization seeking to create a robust, custom Rust-based data science library and optimize WASM performance for Excel file parsing.
Objective:
Develop foundational components for a new Rust-based data science library tailored to specific requirements not addressed by existing libraries (e.g., Linfa, SmartCore). Additionally, optimize Excel file parsing in WebAssembly (WASM) for use in a frontend system, ensuring fast and efficient processing of large Excel files.
Data Science Library Development
Scope:
- Initial Task:Implement logistic regression using Gradient Descent, complete with unit tests, documentation, and a basic GUI for user interaction.
- Broader Goals:Strategize the development of key data science algorithms, such as random variables, distributions, and clustering techniques, with efficient and scalable Rust implementations.
Key Deliverables:
- A standalone logistic regression implementation without external dependencies.
- A proof of concept demonstrating the usability of Rust for scientific and data-intensive tasks, ready for future expansion.
Challenges Addressed:
- Balancing algorithmic accuracy with computational efficiency.
- Adopting Rust's strongly-typed and high-performance nature for scientific computing pipelines.
Outcome:
- Delivered a working implementation of logistic regression, integrated into a WASM-compatible environment.
- Successfully demonstrated the suitability of Rust for advanced data science workflows.
Excel Parsing Optimization in WASMScope:Enhance the speed and reliability of Excel file parsing in WASM for a frontend application. Address limitations in the current implementation, which exhibited significant performance bottlenecks in parsing large files.Approach:
- Initial Investigation:
- Profiled the performance of the existing Rust-based Excel parser, isolating bottlenecks in byte stream reading and XML parsing.
- Benchmarked performance across native Rust, Firefox WASM, and Chrome WASM, identifying Chrome's runtime inefficiencies as a key challenge.
- Enhancements:
- Implemented a more efficient parsing method, bypassing intermediate conversions, and directly generating CSV-compatible output from Excel files.
- Introduced optimizations such as removing type casting and redundant row creation, resulting in a 2x speed improvement over the original implementation.
- Bug Fixes and Compliance:
- Addressed edge cases, including date handling, escaping special characters (e.g., quotes), and sparse data formats.
- Aligned the CSV output with the RFC 4180 specification.
- Integration:
- Integrated the optimized parser into the client's frontend system, ensuring seamless functionality with existing workflows.
- Provided detailed pull requests and documentation for integration and testing.
Challenges Addressed:
- Chrome's WASM runtime inefficiencies, resulting in slower performance compared to Firefox.
- Complexities of parsing sparse Excel data formats directly into CSV or Polars DataFrames.
Outcome:
- Delivered a highly optimized parser capable of handling large Excel files with significantly reduced runtime.
- Implemented robust handling for edge cases, improving reliability and user experience.
Key Technologies and Tools:
- Programming: Rust, WASM, JavaScript.
- Libraries: Quick-XML, Polars, Calamine, RFC4180 compliance tools.
- Testing & Profiling: Flamegraph for native profiling, custom timing scripts for WASM and JavaScript benchmarks.
Client Impact:
- Data Science Library:
- Established a robust foundation for expanding Rust's ecosystem in data science applications.
- Enabled future development of algorithms tailored to the client’s unique requirements.
- Excel Parser:
- Improved Excel parsing speed by 2x while resolving critical bugs and edge cases.
- Enhanced user experience by ensuring compatibility with large files and seamless integration into the frontend.
The client appreciated the team's analytical approach, comprehensive testing, and collaborative spirit, paving the way for future projects.