环境配置

Introduction

Tutorial & Docs

Quick Start

Deploy

Spark 的部署方式可以独立部署,也可以基于 Yarn 或者 Mesos 这样的资源调度框架进行部署。

  • [Spark 下载地址][1]

StandAlone

将 Spark 的程序文件解压之后可以通过如下命令直接启动一个独立的主机:

./sbin/start-master.sh

执行该命令之后 Spark 会自动执行 jetty 命令启动服务器,同时在命令行或者 log 日志文件中打印出系统地址,譬如:

15/05/28 13:20:57 INFO Master: Starting Spark master at spark://localhost.localdomain:7077
15/05/28 13:21:07 INFO MasterWebUI: Started MasterWebUI at http://192.168.199.166:8080

给出的这个 spark://HOST:PORT 地址可以供 Spark work 节点连接或者作为 master 的参数传入到 SparkContext 中。

需要启动 work 进程并在 master 中完成注册,可以使用如下命令:

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

注意,这边的 IP 地址实际上指的是启动 master 时候他监听的域名而来的 IP 地址。

![enter description here][2]

执行上述命令时可以传入的参数为:

Argument Meaning
-h HOST, –host HOST Hostname to listen on
-i HOST, –ip HOST Hostname to listen on (deprecated, use -h or –host)
-p PORT, –port PORT Port for service to listen on (default: 7077 for master, random for worker)
–webui-port PORT Port for web UI (default: 8080 for master, 8081 for worker)
-c CORES, –cores CORES Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
-m MEM, –memory MEM Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine’s total RAM minus 1 GB); only on worker
-d DIR, –work-dir DIR Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker
–properties-file FILE Path to a custom Spark properties file to load (default: conf/spark-defaults.conf)

Cluster Launch

  • sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
  • sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
  • sbin/start-all.sh - Starts both a master and a number of slaves as described above.
  • sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
  • sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
  • sbin/stop-all.sh - Stops both the master and the slaves as described above.

Docker

Application Submit

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

参数如下:

  • –class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • –master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
  • –deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
  • –conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  • application-arguments: Arguments passed to the main method of your main class, if any

StandAlone

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

在任务提交之后,在管理界面会显示:

![enter description here][3]

Program

Initializing Spark

编写 Spark 程序的首先是需要创建一个 JavaSparkContext 对象,用于确定如何去连接到一个集群中。而如果需要创建一个 SparkContext 对象则需要新创建一个包含应用的基本信息 SparkConf 对象。

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);

上述代码中的 appName 是该应用的名称,而 master 指 Spark 进程的 URL,如果对于 StandAlone 进程既是类似于 spark://Host:Port

Resilient Distributed Datasets (RDDs)

Parallelized Collections

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);

External Datasets

RDD Operations

注意,由于 Spark 中采用了大量的异步操作,并不能像普通的 Java 程序中一样去同步进行遍历,大量的遍历等操作是利用类似于回调的方式构造的。

上一页