3 TensorFlow with SeaweedFS
Jack edited this page 2024-05-17 13:56:07 +01:00

Simplest Example

import tensorflow as tf
import os

os.environ["S3_ENDPOINT"] = "http://localhost:8333"

# ...

train_dataset = tf.data.TFRecordDataset(filenames=[
    "s3://bucketname/path/to/file1.tfrecord",
    "s3://bucketname/path/to/file2.tfrecord",
]).map(record_parser).batch(BATCH_SIZE)

# ...

model.fit(train_dataset, ...)

TensorFlow on SeaweedFS S3

TensorFlow already supports S3 (Old Link)

Here is an adaption of it with unnecessary content removed.

Configuration

When reading or writing data on S3 with your TensorFlow program, the behavior can be controlled by various environmental variables:

  • S3_ENDPOINT: The endpoint could be overridden explicitly with S3_ENDPOINT specified.

To read or write objects in a bucket that is not publicly accessible, AWS credentials must be provided through one of the following methods:

  • Set credentials in the AWS credentials profile file on the local system, located at: ~/.aws/credentials on Linux, macOS, or Unix, or C:\Users\USERNAME\.aws\credentials on Windows.
  • Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Example Setup

Using the above information, we can configure Tensorflow to communicate to an S3 endpoint by setting the following environment variables:

S3_ENDPOINT=http://localhost:8333
AWS_ACCESS_KEY_ID=XXXXX                 # Credentials if configured
AWS_SECRET_ACCESS_KEY=XXXXX

Usage

Once setup is completed, Tensorflow can interact with S3 in a variety of ways. Anywhere there is a Tensorflow IO function, an S3 URL can be used.

Smoke Test

To test your setup, stat a file:

from tensorflow.python.lib.io import file_io
print file_io.stat('s3://bucketname/path/')

You should see output similar to this:

<tensorflow.python.pywrap_tensorflow_internal.FileStatistics; proxy of <Swig Object of type 'tensorflow::FileStatistics *' at 0x10c2171b0> >

Reading Data

filenames = ["s3://bucketname/path/to/file1.tfrecord",
             "s3://bucketname/path/to/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)

Tensorflow Tools

Many Tensorflow tools, such as Tensorboard or model serving, can also take S3 URLS as arguments:

tensorboard --logdir s3://bucketname/path/to/model/
tensorflow_model_server --port=9000 --model_name=model --model_base_path=s3://bucketname/path/to/model/export/

This enables an end to end workflow using S3 for all data needs.