High-Performance, Scalable Array Storage with TensorStore

- Advertisement -


This article was published as a part of the Data Science Blogathon.

- Advertisement -

introduction

- Advertisement -

Various contemporary computer science and machine learning applications use multidimensional datasets involving a single detailed coordinate system. Two examples are using wind measurements on a geographic grid to predict the weather or making medical imaging predictions using multi-channel image intensity values ​​from 2D or 3D scans. These datasets can be difficult because users can receive and write data at unpredictable intervals and at different scales. They often want to run studies on multiple workstations at the same time. Under these conditions, even a single dataset may require petabytes of storage.

In light of this, Google AI has introduced TensorStore, an open-source Python and C++ software library designed to store and manipulate N-dimensional data.

- Advertisement -

tensorstore

As we briefly discussed in the Introduction section, TensorStore is an open-source Python and C++ software library designed to store and manipulate N-dimensional data. Multiple storage systems are supported by this module, including local and network file systems, Google Cloud Storage, etc. It provides a similar API that can read and write different array types. The library provides read/writeback caching, transactions, strong atomicity, isolation, consistency and durability (ACID) guarantees. Optimistic concurrency ensures secure access to different programs and systems.

Safe and Performant Scaling

Extensive processing power is required to process and analyze large numerical datasets. Typically, this is done by parallelizing activities between multiple CPU or accelerator cores spread across multiple devices. The primary purpose of TensorStore is to make it possible to process different datasets in parallel in a way that is both safe (i.e., prevents corruption or anomalies due to parallel access patterns) and performant (i.e., read and write to TensorStore are not one bottleneck during computation). In fact, in a test conducted within Google’s data centers, it was noted that a nearly linear scaling of read and write performance as the number of CPUs increased:

Figure 1: Read/Write performance for TensorStore dataset in Zarr format located on Google Cloud Storage (GCS)

Performance is achieved by implementing fundamental operations in C++, making heavy use of multithreading for tasks such as encoding/decoding and network I/O, and breaking up huge datasets to quickly read and write parts of the full dataset . TensorStore also has an asynchronous API that allows read or write operations to continue in the background. At the same time, a program serves other functions, as well as customizable in-memory caching (which minimizes slow storage system interactions for data that is accessed frequently).

Optimistic concurrency, which maintains compatibility with different underlying storage layers (including cloud storage platforms, such as GCS, as well as local file systems), achieves the security of parallel operations when multiple machines access the same dataset. This is done without significantly affecting performance. TensorStore further provides strong ACID guarantees for each action that runs within a runtime.

Additionally, the researchers combined TensorStore with parallel computing frameworks such as Apache Beam (sample code) and Dask, to make distributed computing with TensorStore compatible with previously used data processing workflows (example code).

Use Case 1: Language Model

Introducing more complex language models, such as PaLM, is an exciting new advance in machine learning. These neural networks exhibit excellent natural language processing and generation capabilities with hundreds of billions of parameters. These models also put a strain on the available computational resources; For example, training a language model like PaLM requires thousands of concurrent TPUs.

A challenge during this training process is to efficiently read and write model parameters. Although the training is spread across multiple machines, the parameters must be periodically recorded in a single object (called a “checkpoint”) on the permanent storage system without slowing down the training process. Individual training tasks should be able to read the specific set of parameters to which they relate, eliminating the overhead required to load the entire set of model parameters (which can be 100 gigabytes).

These issues have already been addressed using TensorStore. It is integrated with frameworks such as T5X and Pathway and used to handle checkpoints associated with large scale (“multipod”) models trained with JAX (code sampling). The entire set of parameters, which can take up more than a terabyte of memory, is divided into hundreds of TPUs using model parallelism. TensorStore stores checkpoints in Zarr format, with a chunk structure that enables parallel read and write operations independent of the partition for each TPU.

Use Case 2: 3D Brain Mapping

Synapse-resolution connectomics aims to explore the complex networks of distinct synapses in the animal and human brain. It calls for petabyte-sized datasets to be produced by imaging the brain at extremely high resolution (nanometers) in fields of view of millimeters or more. These databases may eventually reach exabyte sizes as scientists consider mapping the brain of a complete mouse or monkey. In contrast, even a single brain sample may require millions of gigabytes of coordinate systems (pixel space) of hundreds of thousands of pixels in each dimension, posing major storage, manipulation and processing issues. presents.

The researchers used TensorStore to overcome the computational difficulties presented by large-scale connectomic datasets. Using Google Cloud Storage as the underlying object storage technology, TensorStore manages some of the largest and most popular connectomic datasets. For example, it has been used to analyze the human cortex “h01” dataset, a three-dimensional image of brain tissue with nanometer-level resolution. The raw image data, which is 1.4 petabytes (or approximately 500,000 * 350,000 * 5,000 pixels) in size, is also combined with other material, such as 3D segmentation and annotations, that are stored in the same coordinate system. Ideal for web-based interactive viewing and the easily manipulable “neuroglancer precomputed” format from the TensorStore, is used to store raw data, which is split into individual chunks 128x128x16 pixels in size.

tensorstore

A fly brain reconstruction with freely accessible and manipulateable underlying data using TensorStore (source: Google AI)

conclusion

Briefly, in this article, we learned that TensorStore, an open-source C++ and Python software library, is designed for storing and manipulating n-dimensional data:

Enables reading and writing of multiple array formats, such as Zarr and N5, using a unified API. Natively supports multiple storage systems, including local and network file systems, Google Cloud Storage, HTTP servers, and in-memory storage. Supports read/writeback caching and transactions with consistency, strong atomicity, isolation, and durability (ACID) guarantees. Supports efficient and secure access from multiple processes and machines through optimistic concurrency. An asynchronous API provides high-throughput access to external storage even with high latency. Provides fully composable and advanced indexing operations and virtual views.

The media shown in this article is not owned by Analytics Vidya and is used at the sole discretion of the author.

related



Source link

- Advertisement -

Recent Articles

Related Stories