Self-Supervised And Cross-Modal Learning From Videos