Source: devclass.com
Databricks, the company behind open source project Apache Spark, has given its Runtime a good old polishing, buffing the version number up to 5.5.
The new Databricks Runtime is, amongst other things, able to use AWS Glue instead of Hive, and R notebooks have been added to the Python and Scala spanning list of notebooks the product’s Secrets API can inject secrets into.
Version 5.5 also comes with a couple of preview features. One of them is Instance Pools, which lets users hold back some virtual machines which can be used to quickly spin up clusters if needed. While the VMs are idle, only cloud provider costs are incurred with no costs at all if the pool is scaled down to zero instances, according to Databricks.
Those using the Databricks Runtime on AWS can give querying Delta Lake tables from Presto or Amazon Athena a go, and improve the final version by leaving feedback. The function is realised via manifest files the services can examine instead of going through the directory listing to find files.
A feature only available by contacting support, is a new version of the Databricks Filesystem FUSE (Filesystem in userspace) client. The reworked offering is meant to improve performance on all DBFS locations, mounts included, after previous runtime versions already introduced high-performance FUSE storage to dbfs:/ml.
Along with the normal release, there is also a new version of the Runtime for Machine Learning available. Databricks Runtime for ML 5.5 comes with a MLflow 1.0 package added, and upgrades for TensorFlow, PyTorch, and scikit-learn. The ML-specific runtime also saw an HorovodRunner update, giving users a way of distributing their training within a single node, which is meant to make the use of multiple GPUs more efficient.
More adventurous Databricks customers are able to try a preview of a function allowing the recursive loading of files from nested input directories, as well as the Pandas UDF type scalar iterator. The latter can lead to a speedup for some models, since it helps to apply a model to multiple input batches without having to initialise it again and again.
Looking forward, Databricks is planning to drop Python 2 support with the release of Runtime 6.0, which should happen later in 2019. However, there are plans to offer long-term support for the last 5.x release, to make sure there is still a maintained version to run Python 2 code on a little longer if necessary. The step isn’t that surprising, given that that version of the programming language is coming to its end of life next year.