btrfs的去重有丶东西啊……居然能让img镜像实际上实现弹性储存占用……

维格纳的朋友 · 2019年11月1日 17:34

我用的bees，去那边询问“nocow是否会让去重失效以及对虚拟磁盘镜像是否能去重时”，他顺便告诉我“你用qcow2我也能给你去重了，别说你用虚拟磁盘了，就算你用子卷，用快照，用压缩，照样给你去重了，就是加密有点悬，当然推荐你用img，性能更好些，不然就是双重写时复制了，而且btrfs也有快照”
他怎么一说，我想起来我用UEFI引导的QEMU虚拟机，貌似没法快照，忘了还有btrfs这玩意的快照可以用了，因为升级时默认快照险些把我磁盘塞满，我总是认为这东西照虚拟磁盘，怕不是分分钟塞爆磁盘。

sudo btrfs filesystem usage /home
Overall:
    Device size:                   3.49TiB
    Device allocated:              1.22TiB
    Device unallocated:            2.27TiB
    Device missing:                  0.00B
    Used:                        972.49GiB
    Free (estimated):              2.54TiB      (min: 2.54TiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:1.21TiB, Used:969.53GiB
   /dev/nvme0n1p1          1.21TiB

Metadata,single: Size:5.01GiB, Used:2.95GiB
   /dev/nvme0n1p1          5.01GiB

System,single: Size:4.00MiB, Used:160.00KiB
   /dev/nvme0n1p1          4.00MiB

Unallocated:
   /dev/nvme0n1p1          2.27TiB

qemu-img convert -f qcow2 -O raw ~/.vm/Win10.qcow2 ~/.vm/Win10.img

rm -rf ~/.vm/Win10.qcow2
   
sudo btrfs filesystem usage /home
Overall:
    Device size:                   3.49TiB
    Device allocated:              1.07TiB
    Device unallocated:            2.42TiB
    Device missing:                  0.00B
    Used:                        971.70GiB
    Free (estimated):              2.54TiB      (min: 2.54TiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 80.00KiB)

Data,single: Size:1.06TiB, Used:968.77GiB
   /dev/nvme0n1p1          1.06TiB

Metadata,single: Size:5.01GiB, Used:2.94GiB
   /dev/nvme0n1p1          5.01GiB

System,single: Size:4.00MiB, Used:160.00KiB
   /dev/nvme0n1p1          4.00MiB

Unallocated:
   /dev/nvme0n1p1          2.42TiB

实测似乎img去重命中率还要高点！
img完爆qcow2？

说起来机械硬盘上的冷数据压缩后本身差不多1T了
然后再加上虚拟磁盘镜像和一堆openwrt的源码
去重到这份上
属实
不过HDD还是别用，貌似去重会导致大量磁盘碎片，用在HDD上怕不是当场暴毙……

嗯？3.49T的nvme盘
别在意，只是一片随处可见的洋垃圾企业盘而已～

Copy-on-write allows all writes to be continuous–since every write relocates data, all writes can be relocated to contiguous areas, even if the writes themselves are randomly ordered.

If a file is written randomly, then later sequential reads will be slower. The sequential logical order of the reads will not match the random physical order of data on the disk.

If a file is written continuously, then later sequential reads will be faster. This is how the btrfs ‘defrag’ feature works–it simply copies fragmented data into a contiguous area in order, so that future sequential reads are in logical and physical order at the same time.

If a file is read continuously, then performance will be proportional to the size of each non-consecutive fragment. There will be one seek per fragment, plus another seek to read a new metadata block on every ~100th fragment. On SSDs the seeks are replaced with IO transaction overheads, which are almost as expensive as physical head movements on SATA SSD devices.

If a file is read randomly (e.g. hash table lookups), then performance will be close to the worst-case rate all the time.

Data extent fragmentation makes random read performance a little worse, but metadata pages usually fit in RAM cache, so once the cache is hot, only the data block reads contribute significantly to IO load. If you have a really large file and the metadata pages don’t fit in RAM cache, then you’ll take a metadata page read hit for every data block, and on a fast SSD that can be a 80% performance loss (one 16K metadata page random read for each 4K data page random read). Slow disks only have a 50% performance loss (the seek time dominates, so the 16K random read cost is equivalent to the 4K one).

Double the RAM cache costs and/or performance losses from fragmentation if csums are used (each read needs another O(log(n)) metadata page lookup for the csum).

bees开发者对btrfs的写时复制功能的详细解释～
……
屌大的能不能阅读理解一下？

未来等内核补丁被集成，bees开发者打算加入在磁盘被写入数据前直接抛弃数据块的功能，用于遏制ssd写入放大以及降低写入负荷，以及如果可能将让去重与碎片整理同时进行。

Schr0dingerCat · 2019年11月2日 00:24

资瓷，虽然我不懂，

benren · 2019年11月2日 06:22

这个去重是实时的吗？

维格纳的朋友 · 2019年11月2日 07:30

对，如你所见的。
我在删除qcow2镜像后他就将储存空间释放了出来。
而且使用img镜像似乎去重命中率更高。

维格纳的朋友 · 2019年11月2日 07:41

不过目前是在你写入两份重复文件之后他会删除一份重复文件然后链接，这会使写入操作多一次，不过也未必会加剧ssd写入放大，可能有些文件在缓存里就被去重了。
hdd就先别用，因为删除重复数据块本身就会破坏数据文件连续性，然后整理文件碎片又似乎会触发去重…loop。
初次全盘去重就是一次删除全部重复数据，然后链接，这没毛病。

runapp · 2019年11月3日 22:58

你测试vm里的io性能了吗？

sazhufa · 2020年11月23日 13:51

洋垃圾可否推荐个链接来？

维格纳的朋友 · 2020年11月23日 13:57

我也在找 2t 的大船，但很不幸，没有找到……

lilydjwg · 2020年11月29日 09:54

qcow2 支持快照呀。快照虚拟磁盘文件的做法，搞坏了我的 win10 虚拟机：恢复到之前的文件之后，启动总是会到达「选择键盘」然后好像是修复选项，最后告诉你无法修复。

维格纳的朋友 · 2020年11月30日 11:30

所以四舍五入等于没用……

hillwood · 2020年12月2日 02:54

SSD 用洋垃圾不太好吧。

维格纳的朋友 · 2020年12月3日 08:25

实际上企业级的固态可靠性和性能都秒杀民用级全家，特别是现在这个西数全家都冷数据门的时代…
哪怕是服务器里拆机的，已经用了一段时间的。
我读取一年前的冷数据还是有 1200M/s 的速度。
而且企业盘几乎全盘不掉速。

Houge_Langley · 2021年09月9日 07:27

你好，你提到的实时是什么意思？

你要强调的是：In-band 和 Out-of-band 还是？

benren · 2021年09月9日 07:41

没听明白是什么意思。。去重一般有两种方式，一种是在线的，当程序写入数据的时，对比块的 hash，如果磁盘里面已经有相同的，就直接映射那个块。另一种是离线的，额外运行一个离线脚本，检查系统里面每个文件块的的内容，如果有相同的就删掉一个块。

其实我说实时是不对的。应该说在线还是离线。

Houge_Langley · 2021年09月9日 07:53

你好，非常感谢你的回复。

是这样的，我正好做到这方面的文案，我不知道朋友提到这在线和离线这个问题和我的误解是否是一样的。

维格纳的朋友 · 2021年09月9日 08:31

带内去重需要大量内存，就像 zfs。
这个是你写文件到硬盘上，他哈希一遍，然后删掉重复的块。
我只是惊异于去重居然对镜像有效。
总之会加剧写放大。

然后就是 pm1733 3.84T 矿盘入手。
才 2300 块，要什么自行车。

维格纳的朋友 · 2021年09月9日 08:37

现在 btrfs 开发组在搞什么鬼东西我已经完全不清楚了，至少带内去重咕咕咕，内建 bcache 支持完全木有。
我正在计划给 nas 用 zfs，反正都用 Manjaro 了，不用白不用。
对去重有兴趣，请直接去找 bees 的作者，他太懂了。

Houge_Langley · 2021年09月9日 08:42

谢谢，目前去重作为一个简单介绍部分，讲解加操作占用 10 分钟左右，如果太深了，可能劝退了，所以，关于去重的概念和应用，我这里基本没问题了，演示不考虑用 bees ，这个配置加使用不方便实际演示，我到时候用 duperemove 完成，给观众有个初步的认识即可，后续再由这些朋友自己去挖掘相关的工具，目前提供的工具有 bees, deperemove, dduper 其它 btrfs wiki 不建议使用。

Houge_Langley · 2021年09月9日 08:43

太好了，关于 zfs 可以看看我 B 站视频或者到电报群 gentoo-zh 讨论

维格纳的朋友 · 2021年09月9日 09:11

bcache+btrfs 的效果烂得惊人，读 4k 完全没加速。
然后我试了一下 zfs，读 4k 性能有显著提升。
这点为 win 挂载 smb 当游戏盘提供了可能。
然后 Manjaro 装 zfs 真的很方便。
内核大版本更新慢一点就完事了。
然后是 ssd 当读缓存能够生效。
但我不清楚为 hdd 做 cache，能不能让 hdd 寿命更长一些一些。