环境配置
Introduction
Tutorial & Docs
Quick Start
Deploy
Spark 的部署方式可以独立部署,也可以基于 Yarn 或者 Mesos 这样的资源调度框架进行部署。
- [Spark 下载地址][1]
StandAlone
将 Spark 的程序文件解压之后可以通过如下命令直接启动一个独立的主机:
./sbin/start-master.sh
执行该命令之后 Spark 会自动执行 jetty 命令启动服务器,同时在命令行或者 log 日志文件中打印出系统地址,譬如:
15/05/28 13:20:57 INFO Master: Starting Spark master at spark://localhost.localdomain:7077
15/05/28 13:21:07 INFO MasterWebUI: Started MasterWebUI at http://192.168.199.166:8080
给出的这个 spark://HOST:PORT 地址可以供 Spark work 节点连接或者作为 master 的参数传入到 SparkContext 中。
需要启动 work 进程并在 master 中完成注册,可以使用如下命令:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
注意,这边的 IP 地址实际上指的是启动 master 时候他监听的域名而来的 IP 地址。
![enter description here][2]
执行上述命令时可以传入的参数为:
Argument | Meaning |
---|---|
-h HOST, –host HOST | Hostname to listen on |
-i HOST, –ip HOST | Hostname to listen on (deprecated, use -h or –host) |
-p PORT, –port PORT | Port for service to listen on (default: 7077 for master, random for worker) |
–webui-port PORT | Port for web UI (default: 8080 for master, 8081 for worker) |
-c CORES, –cores CORES | Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker |
-m MEM, –memory MEM | Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine’s total RAM minus 1 GB); only on worker |
-d DIR, –work-dir DIR | Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker |
–properties-file FILE | Path to a custom Spark properties file to load (default: conf/spark-defaults.conf) |
Cluster Launch
- sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
- sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
- sbin/start-all.sh - Starts both a master and a number of slaves as described above.
- sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
- sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
- sbin/stop-all.sh - Stops both the master and the slaves as described above.
Docker
Application Submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
参数如下:
- –class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
- –master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
- –deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
- –conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
- application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
- application-arguments: Arguments passed to the main method of your main class, if any
StandAlone
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster
--supervise
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
在任务提交之后,在管理界面会显示:
![enter description here][3]
Program
Initializing Spark
编写 Spark 程序的首先是需要创建一个 JavaSparkContext 对象,用于确定如何去连接到一个集群中。而如果需要创建一个 SparkContext 对象则需要新创建一个包含应用的基本信息 SparkConf 对象。
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
上述代码中的 appName 是该应用的名称,而 master 指 Spark 进程的 URL,如果对于 StandAlone 进程既是类似于 spark://Host:Port
Resilient Distributed Datasets (RDDs)
Parallelized Collections
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
External Datasets
RDD Operations
注意,由于 Spark 中采用了大量的异步操作,并不能像普通的 Java 程序中一样去同步进行遍历,大量的遍历等操作是利用类似于回调的方式构造的。