Productive High Performance Parallel Programming With Auto-Tuned Domain-Specific Embedded Languages