mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2025-01-22 17:40:49 +08:00
27c74a7e66
change replication_type to ReplicaPlacement, hopefully cleaner code works for 9 possible ReplicaPlacement xyz x : number of copies on other data centers y : number of copies on other racks z : number of copies on current rack x y z each can be 0,1,2 Minor: weed server "-mdir" default to "-dir" if empty
110 lines
4.2 KiB
Plaintext
110 lines
4.2 KiB
Plaintext
1. each file can choose the replication factor
|
|
2. replication granularity is in volume level
|
|
3. if not enough spaces, we can automatically decrease some volume's the replication factor, especially for cold data
|
|
4. plan to support migrating data to cheaper storage
|
|
5. plan to manual volume placement, access-based volume placement, auction based volume placement
|
|
|
|
When a new volume server is started, it reports
|
|
1. how many volumes it can hold
|
|
2. current list of existing volumes and each volume's replication type
|
|
Each volume server remembers:
|
|
1. current volume ids
|
|
2. replica locations are read from the master
|
|
|
|
The master assign volume ids based on
|
|
1. replication factor
|
|
data center, rack
|
|
2. concurrent write support
|
|
On master, stores the replication configuration
|
|
{
|
|
replication:{
|
|
{type:"00", min_volume_count:3, weight:10},
|
|
{type:"01", min_volume_count:2, weight:20},
|
|
{type:"10", min_volume_count:2, weight:20},
|
|
{type:"11", min_volume_count:3, weight:30},
|
|
{type:"20", min_volume_count:2, weight:20}
|
|
},
|
|
port:9333,
|
|
}
|
|
Or manually via command line
|
|
1. add volume with specified replication factor
|
|
2. add volume with specified volume id
|
|
|
|
|
|
If duplicated volume ids are reported from different volume servers,
|
|
the master determines the replication factor of the volume,
|
|
if less than the replication factor, the volume is in readonly mode
|
|
if more than the replication factor, the volume will purge the smallest/oldest volume
|
|
if equal, the volume will function as usual
|
|
|
|
|
|
Use cases:
|
|
on volume server
|
|
1. weed volume -mserver="xx.xx.xx.xx:9333" -publicUrl="good.com:8080" -dir="/tmp" -volumes=50
|
|
on weed master
|
|
1. weed master -port=9333
|
|
generate a default json configuration file if doesn't exist
|
|
|
|
Bootstrap
|
|
1. at the very beginning, the system has no volumes at all.
|
|
When data node starts:
|
|
1. each data node send to master its existing volumes and max volume blocks
|
|
2. master remembers the topology/data_center/rack/data_node/volumes
|
|
for each replication level, stores
|
|
volume id ~ data node
|
|
writable volume ids
|
|
If any "assign" request comes in
|
|
1. find a writable volume with the right replicationLevel
|
|
2. if not found, grow the volumes with the right replication level
|
|
3. return a writable volume to the user
|
|
|
|
|
|
for data node:
|
|
0. detect existing volumes DONE
|
|
1. onStartUp, and periodically, send existing volumes and maxVolumeCount store.Join(), DONE
|
|
2. accept command to grow a volume( id + replication level) DONE
|
|
/admin/assign_volume?volume=some_id&replicationType=01
|
|
3. accept setting volumeLocationList DONE
|
|
/admin/set_volume_locations_list?volumeLocationsList=[{Vid:xxx,Locations:[loc1,loc2,loc3]}]
|
|
4. for each write, pass the write to the next location, (Step 2)
|
|
POST method should accept an index, like ttl, get decremented every hop
|
|
for master:
|
|
1. accept data node's report of existing volumes and maxVolumeCount ALREADY EXISTS /dir/join
|
|
2. periodically refresh for active data nodes, and adjust writable volumes
|
|
3. send command to grow a volume(id + replication level) DONE
|
|
5. accept lookup for volume locations ALREADY EXISTS /dir/lookup
|
|
6. read topology/datacenter/rack layout
|
|
|
|
An algorithm to allocate volumes evenly, but may be inefficient if free volumes are plenty:
|
|
input: replication=xyz
|
|
algorithm:
|
|
ret_dcs = []
|
|
foreach dc that has y+z+1 volumes{
|
|
ret_racks = []
|
|
foreach rack with z+1 volumes{
|
|
ret = select z+1 servers with 1 volume
|
|
if ret.size()==z+1 {
|
|
ret_racks.append(ret)
|
|
}
|
|
}
|
|
randomly pick one rack from ret_racks
|
|
ret += select y racks with 1 volume each
|
|
if ret.size()==y+z+1{
|
|
ret_dcs.append(ret)
|
|
}
|
|
}
|
|
randomly pick one dc from ret_dcs
|
|
ret += select x data centers with 1 volume each
|
|
|
|
A simple replica placement algorithm, but may fail when free volume slots are not plenty:
|
|
ret := []volumes
|
|
dc = randomly pick 1 data center with y+z+1 volumes
|
|
rack = randomly pick 1 rack with z+1 volumes
|
|
ret = ret.append(randomly pick z+1 volumes)
|
|
ret = ret.append(randomly pick y racks with 1 volume)
|
|
ret = ret.append(randomly pick x data centers with 1 volume)
|
|
|
|
|
|
TODO:
|
|
1. replicate content to the other server if the replication type needs replicas
|