当我们想将一个单机的tensorflow训练程序改写成分布式训练（多机多卡）的时候，一般有两个大方向的选择：1.完全异步的梯度更新策略，其代表方法是parameter server架构。2.同步的梯度更新策略，代表方法有：百度的ring all-reduce策略。本文首先介绍parameter server架构。

parameter server策略：

parameter server异步更新策略是指每个 GPU或者CPU计算完梯度后，无需等待其他 GPU或CPU的梯度计算（有时可以设置需要等待的梯度个数），就可立即更新整体的权值，然后同步此权值，即可进行下一轮计算。

parameter server的架构
而Tensorflow一开始支持分布式的时候，便是这种parameter server架构。TensorFlow一般将任务分为两类job：一类叫参数服务器，parameter server，简称为ps，用于存储可训练的参数变量tf.Variable；一类就是普通任务，称为worker，用于执行具体的计算。

Tensorflow支持两种方式实现parameter server：低阶API创建parameter server集群方式和tf.distribute.Strategy中的ParameterServerStrategy。

低阶API创建parameter server集群

完整案例 dist_tf.py：

import tensorflow as tf
import numpy as np

# 创建集群信息，包括ps和worker两种角色。
# 集群有两类任务，ps和worker；ps由2个任务组成（一般一个任务是一个机器或者一个分配单元），worker由3个任务组成。
ps_hosts = ["xx.xxx.xx.xxxx:oooo", "xx.xxx.xx.xxxx:oooo"]
worker_hosts = ["xx.xxx.xx.xxxx:oooo", "xx.xxx.xx.xxxx:oooo", "xx.xxx.xx.xxxx:oooo"]
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

tf.app.flags.DEFINE_string("job_name", "worker", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
FLAGS = tf.app.flags.FLAGS

def main(_):
    server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
    if FLAGS.job_name == "ps":
        server.join()
    else:
        # 会根据job名，将with内的Variable op放到ps tasks，将其他计算op放到worker tasks。默认分配策略是轮询
        with tf.device(tf.train.replica_device_setter(
                worker_device="/job:worker/task:%d" % FLAGS.task_index,
                cluster=cluster)):

            x_data = tf.placeholder(tf.float32, [100])
            y_data = tf.placeholder(tf.float32, [100])

            W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
            b = tf.Variable(tf.zeros([1]))
            y = W * x_data + b
            loss = tf.reduce_mean(tf.square(y - y_data))

            global_step = tf.Variable(0, name="global_step", trainable=False)
            optimizer = tf.train.GradientDescentOptimizer(0.1)
            train_op = optimizer.minimize(loss, global_step=global_step)

            # The StopAtStepHook handles stopping after running given steps.
            hooks = [tf.train.StopAtStepHook(last_step=1000000)]
            # The MonitoredTrainingSession takes care of session initialization,
            # restoring from a checkpoint, saving to a checkpoint, and closing when done
            # or an error occurs.
            with tf.train.MonitoredTrainingSession(master=server.target,
                                                   is_chief=(FLAGS.task_index == 0),
                                                   # 我们制定task_index为0的任务为主任务，用于负责变量初始化、做checkpoint、保存summary和复原
                                                   checkpoint_dir="/tmp/tf_train_logs",
                                                   save_checkpoint_secs=None,
                                                   hooks=hooks) as mon_sess:
                while not mon_sess.should_stop():
                    # Run a training step asynchronously.
                    # See `tf.train.SyncReplicasOptimizer` for additional details on how to
                    # perform *synchronous* training.
                    # mon_sess.run handles AbortedError in case of preempted PS.
                    train_x = np.random.rand(100).astype(np.float32)
                    train_y = train_x * 0.1 + 0.3
                    _, step, loss_v, weight, biase = mon_sess.run([train_op, global_step, loss, W, b],
                                                                  feed_dict={x_data: train_x, y_data: train_y})
                    if step % 100 == 0:
                        print("step: %d, weight: %f, biase: %f, loss: %f" % (step, weight, biase, loss_v))
                print("Optimization finished.")


if __name__ == "__main__":
    tf.app.run()

对于本例而言，我们需要在对应的5台机器上分别运行每个任务，共需执行五次代码，生成五个任务。

python dist_tf.py --job_name=ps --task_index=0
python dist_tf.py --job_name=ps --task_index=1
python dist_tf.py --job_name=worker --task_index=0
python dist_tf.py --job_name=worker --task_index=1
python dist_tf.py --job_name=worker --task_index=2

低阶API创建parameter server集群缺点：

概念多，学习曲线陡峭。
单机代码到多机修改的代码量大。
需要多台机子跑不同的脚本，当然这可以通过k8s集群管理工具来解决。
PS 和 Worker 的比例不好选取。（建议选取偶数个的ps，我的经验是ps和worker的比例是1:3）
训练速度性能损失较大。（通信代价较高）
parameter server常见的优化点：

如果有参数量较大的embedding变量时，可选择使用embedding_lookup_sparse_with_distributed_aggregation函数替代tf.nn.embedding_lookup_sparse函数。该函数可将embedding的聚合计算都放在变量所在的PS端，计算后转成稠密张量再传送到Worker上继续网络模型的计算。
tf.device函数中有一个参数是设置变量在ps端放置策略的，可使用tf.contrib.training.GreedyLoadBalancingStrategy来替代默认的轮循。优点是：可根据参数的内存字节来完成类似在线垃圾收集的工作。根据weight和bias的字节数来放置到内存合适的task中，带来更好的负载平衡。
当参数有超大量级时（比如embedding参数），可在创建变量的时候使用分割变量策略：partitioner=tf.fixed_size_partitioner(ps_nums)
优化input pipeline。链接：https://www.tensorflow.org/guide/performance/datasets
bandwidth高带宽范亲和策略，保证多个ps分布在不同的物理机上。
Estimator中的ParameterServerStrategy策略

# https://stackoverflow.com/questions/55003279/parameter-server-strategy-with-estimatorstensorflow
import tensorflow as tf
import os
import json

NUM_WORKERS = 1
IP_ADDRS = ['localhost']
PORTS = [12345]

def model_fn(...):
    .....

def input_fn(...):
    .....

需要每个机器配置TF_CONFIG环境变量

os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': ['%s:%d' % (IP_ADDRS[w], PORTS[w]) for w in range(NUM_WORKERS)],
        'ps': ['%s:%d' % (IP_ADDRS[w], PORTS[w]) for w in range(NUM_WORKERS)]
    },
    'task': {'type': 'worker', 'index': 0}
})

# Method for using ParamterServerStrategy
strategy = tf.distribute.experimental.ParameterServerStrategy()

config = tf.estimator.RunConfig(train_distribute=strategy)

classifier = tf.estimator.Estimator(
    model_fn=model_fn, model_dir='/tmp/multiworker', config=config)
tf.estimator.train_and_evaluate(
    classifier,
    train_spec=tf.estimator.TrainSpec(input_fn=input_fn),
    eval_spec=tf.estimator.EvalSpec(input_fn=input_fn)
)

本文转载自Alex-zhai知乎账号。
原文链接：https://zhuanlan.zhihu.com/p/69010949

创作场景

浅谈 Tensorflow 分布式架构：parameter server 及优化策略