Author Bias
I am primarily a Java developer. This inherently introduces bias:
- I have deeper knowledge of Java frameworks and their optimal configurations
- I may have inadvertently optimized Java implementations better than others
- My code review of non-Java implementations may miss language-specific anti-patterns or performance pitfalls
- Framework choices in other languages were based on popularity rather than personal expertise
On the bright side, my bias occasionally helps catch bugs! When I saw Node.js suspiciously outperforming everything else, I thought "there's no way Node is THAT fast" and investigated. Turns out Claude had implemented the artificial delay with an empty loop that V8 happily optimized away to nothing. So yes, my Java-flavored skepticism saved the benchmark integrity. You're welcome, Node developers. (The bug is fixed now, Node is back to normal performance... most of the time. I still catch it being suspiciously fast in some results.)
Code Authorship
Different parts of this project have different levels of human involvement:
HAND-WRITTEN CODE (by me):
- Java shared domain module
- Quarkus implementations
- Spring implementations
- Spring Boot 3 implementations
- Spring Boot 4 implementations
- Core benchmark logic and delay simulation
AI-ASSISTED (generated by Claude, reviewed by me):
- Micronaut implementations
- Helidon SE implementations
- OpenLiberty implementations
- WildFly implementations
- TomEE implementations
- GraalVM native-image configurations (reflect-config.json and such) for frameworks that didn't work out of the box
- All build scripts (Python)
FULLY AI-GENERATED (generated by Claude, reviewed but may contain errors):
- All Go implementations
- All Rust implementations
- All Python implementations
- All PHP implementations
- All Node.js implementations
- The entire web report UI
Web Report UI
THE WEB UI WAS FULLY VIBE-CODED WITH CLAUDE AND THE CODE WAS NEVER CHECKED.
Frontend development is not my main expertise. The entire web report interface (HTML, CSS, JavaScript, charting logic) was generated through AI conversation without manual code review. This was intentionally done to test "full vibe coding" as an experiment. It works for my purposes, but:
- May contain bugs or inefficiencies
- May not follow frontend best practices
- May have accessibility issues
- May break on certain browsers or screen sizes
Use the web UI as a visualization tool, not as a reference implementation.
UI Controls Guide
Filter buttons (Language, Framework, Pattern, Build Type):
- Left-click: Toggle individual items on/off
- Right-click on an item: Select ONLY that item (deselect all others)
- Right-click on an already-solo item: Select ALL items in that group
Special filters:
- "Merge": Combines multiple runs of the same configuration into one series
- "Best": Shows only the best-performing variant per framework (does NOT work in all views - it's buggy, you've been warned)
Chart controls:
- Grid: Display multiple charts in a grid layout on one screen
- Legend: Toggle the chart legend visibility
- Crop: Crops the Y-axis to exclude extreme outliers. This was specifically added because some tests failed catastrophically (timeouts, errors) and produced data points that made the rest of the chart unreadable. Use this to focus on the "normal" performance range.
Resolution slider - HUGE WARNING
The resolution slider controls data point sampling/aggregation.
High resolution may CRASH YOUR BROWSER if you have too many series selected. Only use high resolution on heavily filtered views.
Rule of thumb:
- Many series selected → Use LOW resolution
- Few series selected → Use HIGH resolution for accurate analysis
Non-Java Implementations
Important caveats about non-Java language implementations:
1. NO SHARED CODE
Unlike Java apps which share a domain module, each non-Java framework is a completely standalone project. This means:
- Delay simulation logic was duplicated (may have subtle differences)
- JSON parsing/serialization varies by framework
- Error handling is inconsistent across implementations
2. DEV SERVER CONCERNS
Some implementations may be running in development mode rather than production mode. I identified and fixed several cases, but others may remain:
- Flask/Django without gunicorn in production mode
- Node.js with debug flags
- PHP with development error reporting
If you notice an implementation performing unexpectedly poorly, this could be the cause.
3. LIMITED OPTIMIZATION
I intentionally kept the code simple and avoided framework-specific optimizations to maintain consistency. However, this means:
- Implementations may not represent the framework's full potential
- Production-grade code would likely perform differently
- Connection pooling, caching, and other optimizations were not used
Test Environment
Hardware:
- CPU: Intel Core i9-14900KS
- CPU Mode: POWER-SAVING MODE (because Intel's recent CPUs have stability issues at full power - yes, Intel sucks)
- This means performance numbers are artificially lower than what the hardware could achieve at full power
- WSL2 allocated memory: 120GB
- No memory limits were applied to any application
- Java thrives in environments with abundant memory (as you can see in the benchmarks) - results may differ significantly in memory-constrained environments like containers with low limits
Software Environment:
- Applications ran inside WSL2 (Windows Subsystem for Linux 2)
- k6 load testing ran from Windows host
- Tests used direct WSL2 IP address to minimize network overhead
- This setup was chosen to reduce interference between load generator and applications under test
Network Path:
Windows k6 → Hyper-V virtual switch → WSL2 VM → Application
This is NOT equivalent to:
- Native Linux performance
- Docker container performance
- Bare metal performance
- Cloud/VM performance
Compute Endpoint Limitations
The compute workload calculates square roots to a specified precision. There are known issues:
1. DOUBLE PRECISION LIMITATION
The implementation uses double-precision floating point, which has limited precision (~15-17 significant digits). This means:
- Very high precision targets may not be achievable
- The algorithm may terminate early or behave unexpectedly
- This could explain the dual grouping of data points visible in some benchmark charts
2. ALGORITHM QUALITY
The square root calculation algorithm was not optimized for numerical accuracy or performance. It's a simple iterative approach that may:
- Have varying iteration counts for similar inputs
- Produce inconsistent CPU load across implementations
- Not be the best representation of "compute-bound" workloads
3. CROSS-LANGUAGE INCONSISTENCY
Floating-point behavior varies by language and runtime:
- Java, Go, Rust may produce different results for the same input
- Some languages may optimize the calculation differently
- JIT compilation may affect computation patterns over time
Benchmark Methodology
Workload Types:
- "fast": Minimal processing to test pure framework latency. Includes basic payload validation to prevent frameworks from over-optimizing it away.
- "slow": Fixed artificial delay (Thread.sleep or equivalent)
- "compute": CPU-bound square root calculation
- "sqrt_stable": Deterministic code path - the same predictable code runs every time, making it friendly to CPU branch prediction and JIT optimization
- "sqrt_unstable": Random branch selection - a random choice determines which code branch executes, but the overall computational complexity remains the same. This defeats branch prediction and may show different JIT behavior.
- "real": Attempts to mimic a realistic endpoint with both computation and wait time, without using any external resources (database, network calls) to avoid contaminating the benchmark with external factors.
Test Configuration:
Load tests (k6):
- Virtual Users (VUs): Ramped from 0 to target
- Ramp-up time: 5 minutes
- Sustained load: 1 minute at target VUs
- Warmup: JIT applications had warmup period before measurement
Startup benchmarks:
- Each application was started 10 times
- 50 HTTP requests (curl) per start to measure latency
- Memory measured via /proc after startup
What These Benchmarks DO Measure:
- Relative performance differences between frameworks
- Throughput under sustained concurrent load
- Latency distribution patterns
- How frameworks handle thread/connection exhaustion
What These Benchmarks DO NOT Measure:
- Real-world application performance (no database, no external services)
- Cold start in serverless environments
- Memory efficiency for long-running applications
- Framework ergonomics or developer productivity
Recommendations
When interpreting results:
- Focus on PATTERNS, not absolute numbers
The test environment introduces overhead. Compare frameworks relative to each other, not against theoretical maximums.
- Consider the WORKLOAD type
Different workloads (IO-bound vs CPU-bound) favor different architectures. A framework's ranking may change significantly between workload types.
- Remember the BIAS
Java implementations received more attention and optimization. Non-Java results should be taken with additional skepticism.
- This is a LEARNING PROJECT
The primary goal was to understand and compare concurrency models, not to definitively rank frameworks for production use.
- DO YOUR OWN BENCHMARKS
For production decisions, benchmark with your actual workload, your actual hardware, and your actual deployment environment.
Attribution
- Project concept and Java implementations: Human (me)
- AI assistance: Claude (Anthropic)
- Load testing: k6 (Grafana Labs)
- Charting: Plotly.js
- No actual humans were harmed in the making of this benchmark. Probably.
License & Usage
This project is provided for educational and demonstration purposes. Feel free to use the results in presentations, but please:
- Link back to the original project
- Include a reference to this disclaimer
- Acknowledge the limitations described above