mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2024-11-23 18:49:17 +08:00
Page:
run Spark on SeaweedFS
Pages
AWS CLI with SeaweedFS
AWS IAM CLI
Actual Users
Amazon IAM API
Amazon S3 API
Applications
Async Backup
Async Filer Metadata Backup
Async Replication to Cloud
Async Replication to another Filer
Benchmark SeaweedFS as a GlusterFS replacement
Benchmarks from jinleileiking
Benchmarks
Cache Remote Storage
Choosing a Filer Store
Client Libraries
Cloud Drive Architecture
Cloud Drive Benefits
Cloud Drive Quick Setup
Cloud Monitoring
Cloud Tier
Components
Configure Remote Storage
Customize Filer Store
Data Backup
Data Structure for Large Files
Deployment to Kubernetes and Minikube
Directories and Files
Docker Compose for S3
Docker Image Registry with SeaweedFS
Environment Variables
Erasure Coding for warm storage
Error reporting to sentry
FAQ
FIO benchmark
FUSE Mount
Failover Master Server
Filer Active Active cross cluster continuous synchronization
Filer Cassandra Setup
Filer Change Data Capture
Filer Commands and Operations
Filer Data Encryption
Filer JWT Use
Filer Metadata Events
Filer Redis Setup
Filer Server API
Filer Setup
Filer Store Replication
Filer Stores
Filer as a Key Large Value Store
Gateway to Remote Object Storage
Getting Started
HDFS via S3 connector
Hadoop Benchmark
Hadoop Compatible File System
Hardware
Hobbyest Tinkerer scale on premises tutorial
Home
Independent Benchmarks
Kubernetes Backups and Recovery with K8up
Large File Handling
Load Command Line Options from a file
Master Server API
Migrate to Filer Store
Mount Remote Storage
Optimization
Path Specific Configuration
Path Specific Filer Store
Production Setup
Replication
Run Blob Storage on Public Internet
Run Presto on SeaweedFS
S3 API Audit log
S3 API Benchmark
S3 API FAQ
S3 Bucket Quota
S3 Nginx Proxy
SRV Service Discovery
SeaweedFS Java Client
SeaweedFS in Docker Swarm
Security Configuration
Security Overview
Server Startup Setup
Store file with a Time To Live
Super Large Directories
System Metrics
TensorFlow with SeaweedFS
Tiered Storage
UrBackup with SeaweedFS
Use Cases
Volume Files Structure
Volume Management
Volume Server API
WebDAV
Words from SeaweedFS Users
fstab
nodejs with Seaweed S3
rclone with SeaweedFS
restic with SeaweedFS
run HBase on SeaweedFS
run Spark on SeaweedFS
s3cmd with SeaweedFS
weed shell
37
run Spark on SeaweedFS
chrislu edited this page 2024-11-18 22:47:15 -08:00
Installation for Spark
Follow instructions on spark doc:
- https://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration
- https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
installation inheriting from Hadoop cluster configuration
Inheriting from Hadoop cluster configuration should be the easiest way.
To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh to a location containing the configuration file core-site.xml
, usually /etc/hadoop/conf
installation not inheriting from Hadoop cluster configuration
Copy the seaweedfs-hadoop2-client-3.80.jar to all executor machines.
Add the following to spark/conf/spark-defaults.conf on every node running Spark
spark.driver.extraClassPath=/path/to/seaweedfs-hadoop2-client-3.80.jar
spark.executor.extraClassPath=/path/to/seaweedfs-hadoop2-client-3.80.jar
And modify the configuration at runtime:
./bin/spark-submit \
--name "My app" \
--master local[4] \
--conf spark.eventLog.enabled=false \
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \
--conf spark.hadoop.fs.defaultFS=seaweedfs://localhost:8888 \
myApp.jar
Example
- change the spark-defaults.conf
spark.driver.extraClassPath=/Users/chris/go/src/github.com/seaweedfs/seaweedfs/other/java/hdfs2/target/seaweedfs-hadoop2-client-3.80.jar
spark.executor.extraClassPath=/Users/chris/go/src/github.com/seaweedfs/seaweedfs/other/java/hdfs2/target/seaweedfs-hadoop2-client-3.80.jar
spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem
- create the spark history folder
$ curl -X POST http://192.168.2.3:8888/spark2-history/
- Run a spark job
$ bin/spark-submit --name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.jars.ivy=/tmp/.ivy \
--conf spark.eventLog.enabled=true \
--conf spark.hadoop.fs.defaultFS=seaweedfs://192.168.2.3:8888 \
--conf spark.eventLog.dir=seaweedfs://192.168.2.3:8888/spark2-history/ \
file:///usr/local/spark/examples/jars/spark-examples_2.12-3.0.0.jar
A Full Example
Here is my local example switching everything to SeaweedFS. In the /usr/local/spark/conf/spark-defaults.conf
file,
- this is my local
/usr/local/spark/conf/spark-defaults.conf
spark.eventLog.enabled=true
spark.sql.hive.convertMetastoreOrc=true
spark.yarn.queue=default
spark.master=local
spark.history.ui.port=18081
spark.history.fs.cleaner.interval=7d
spark.sql.statistics.fallBackToHdfs=true
spark.yarn.historyServer.address=master:18081
spark.sql.orc.filterPushdown=true
spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.cleaner.maxAge=90d
spark.sql.orc.impl=native
spark.history.fs.cleaner.enabled=true
spark.history.fs.logDirectory=seaweedfs://localhost:8888/spark2-history/
spark.eventLog.dir=seaweedfs://localhost:8888/spark2-history/
spark.driver.extraClassPath=/Users/chris/go/src/github.com/seaweedfs/seaweedfs/other/java/hdfs2/target/seaweedfs-hadoop2-client-3.80.jar
spark.executor.extraClassPath=/Users/chris/go/src/github.com/seaweedfs/seaweedfs/other/java/hdfs2/target/seaweedfs-hadoop2-client-3.80.jar
spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem
spark.hadoop.fs.defaultFS=seaweedfs://localhost:8888
- create the spark history folder
$ curl -X POST http://192.168.2.4:8888/spark2-history/
- Run a spark shell
$ bin/spark-shell
20/10/18 14:11:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/10/18 14:12:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.2.4:4040
Spark context available as 'sc' (master = local, app id = local-1603055539864).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.textFile("/buckets/large/ttt.txt").count
res0: Long = 9374
Introduction
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- Server Startup Setup
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
FUSE Mount
WebDAV
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
AWS IAM
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Messaging
Use Cases
Operations
Advanced
- Large File Handling
- Optimization
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure