跳转至

2024

计算panel bed编码区长度

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
#@File    :   run.sh
#@Time    :   2023/11/21 09:32:17
#@Author  :   biolxy
#@Version :   1.0
#@Contact :   biolxy@aliyun.com
#@Desc    :   None

inputbed=$1

# export PATH=/data/biogonco/lixy/bedtools/bedtools2/bin:$PATH  # v2.30.1 不行
export PATH=/data/bioinfo_project/bioinfo_miniconda3/bin:$PATH  # v2.30.0 不行
# /data/biogonco/lixy/bedtools/bedtools2/bin/bedtools

bedtools merge -i ${inputbed} > merge.bed

bedtools summary -i merge.bed -g /mnt/nas_001/project/oncodemo/genome_and_annotation/hg19/gatk/b37/chrom.sizes > merge.bed.summary  # 统计的是 bed 区间的长度


# 去 gencode 下载 gencode.v41.annotation.gff3
# wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh37_mapping/gencode.v44lift37.annotation.gff3.gz
# /mnt/nas_001/project/oncodemo/genome_and_annotation/genome/gffread-0.12.7.Linux_x86_64/gffread gencode.v44lift37.annotation.gff3 -T -o gencode.v44lift37.annotation.bed
# awk '$3 == "CDS"' gencode.v44lift37.annotation.bed | sed 's/^chr//g' > coding_CDS.bed

# coding_CDS.bed 为编码区bed

bedtools intersect -a merge.bed -b /mnt/nas_001/project/oncodemo/genome_and_annotation/genome/coding_CDS.bed | sort -k1,1 -k2,2n |  bedtools merge -i - > panel.cds.bed

# panel.cds.bed 为 pane 的 编码区 bed
bedtools summary -i panel.cds.bed -g /mnt/nas_001/project/oncodemo/genome_and_annotation/hg19/gatk/b37/chrom.sizes > panel.cds.summary 

注意

要注意bedtools的版本问题,我测试的 v2.30.0可以,v2.30.1 不行

准确度和灵敏度和特异性

定义

Precision 准确度:指找出来的突变中的真阳性有多少

Recall 召回率:指总突变集合有多少可以被找出来

公式

准确度(Precision)

准确度是指在所有被预测为阳性(即突变)的样本中,实际为阳性(真阳性)的比例。换句话说,它衡量的是预测结果中真阳性的比例。其计算公式为:

\(\text{Precision} = \frac{TP}{TP + FP}\)

其中,TP是真阳性的数量,FP是假阳性的数量。

召回率(Recall)或 灵敏度/敏感度(Sensitivity)

召回率是指在所有实际为阳性的样本中,被正确预测为阳性(即真阳性)的比例。它衡量的是测试方法捕捉到所有实际阳性样本的能力。其计算公式为:

\(\text{Recall} = \frac{TP}{TP + FN}\)

其中,TP是真阳性的数量,FN是假阴性的数量。

这两个指标通常用于评估分类模型的性能,特别是在医学和生物学研究中,它们帮助研究者理解一个测试或分析方法在识别阳性样本方面的准确性和完整性。在体细胞变异检测的背景下,这两个指标对于评估变异检测算法的性能至关重要。

特异性(Specificity)

特异性是指在实际没有某种疾病的人群中,诊断测试能够正确排除非患者的能力。它衡量了测试对非患者的"特异性",即测试能够准确地排除非患者的能力。 特异性的计算公式为:

特异性 = 真阴性(True Negative)/(真阴性 + 假阳性(False Positive))

\(\text{Specificity}= \frac{TN}{FP + TN}\)

注意: 这里不使用 特异性 这一概念,原因是在NGS数据calling中,难以确定真阴性位点的数量,特异性无法计算。

F1_score

F1_score,是评估分类模型性能的一种指标,特别是在二分类问题中。它结合了精确度(Precision)和召回率(Recall)两个指标来提供一个单一的评分,以衡量模型的整体性能。 F1_score 是精确度和召回率的调和平均数,公式为:

\(F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)

F1_score 的值范围是0到1,1表示完美的精确度和召回率,0表示模型性能很差。

在处理不平衡的数据集时,F1_score 特别有用,因为它同时考虑了精确度和召回率,而不是仅仅关注其中一个。这使得 F1_score 成为一个在多种情况下都相对平衡的性能度量标准。

使用MkDocs+GithubPage 搭建个人博客

使用 MkDocs 的原因是 语雀没有会员的话,即使是(设置过的)互联网访问的文库,新创建的文档他人也无法打开链接,想来还是重新回转到博客了

1. 参考

  • ref1: https://wcowin.work/blog/Mkdocs/mkdocs1.html

  • ref2: https://www.cnblogs.com/chinjinyu/p/17610438.html

  • 配置文件
  • https://zhuanlan.zhihu.com/p/688322635

2. 具体操作

具体操作参考 ref1中的步骤,有几个坑需要注意: 1. 我之前使用过GithuaPage,我没有删除旧项目,而是保留了.git 文件夹,清除了其文件,此时,我的主要分支是 master, 因此 ci.yml 中需要对于的修改:

YAML
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
name: ci 
on: # 在什么时候触发工作流
  push: # 在从本地master分支被push到GitHub仓库时
    branches:
      - master
      - main
  pull_request: # 在master分支合并别人提的pr时
    branches:
      - master
      - main
jobs: # 工作流的具体内容
  deploy:
    runs-on: ubuntu-latest # 创建一个新的云端虚拟机 使用最新Ubuntu系统
    steps:
      - uses: actions/checkout@v4 # 先checkout到main分支
      - uses: actions/setup-python@v4 # 再安装Python3和相关环境
        with:
          python-version: 3.x
      - run: pip install mkdocs-material # 使用pip包管理工具安装mkdocs-material
      - run: mkdocs gh-deploy --force # 使用mkdocs-material部署gh-pages分支
  1. 使用 blog 插件,配置 authors 信息时,自23-08之后,.authors.yml 文件路径修改了,默认路径是在 docs 文件夹下:
YAML
1
2
3
4
5
authors:
  biolxy:
    name: biolxy    # Author name
    description: 生物信息工程师 # Author description
    avatar: https://gravatar.com/avatar/d63d3425433de59d2c3e901977a581c7?size=256&cache=1715411947733

1. 原文参考

● 利用贝叶斯定理判断SNV 一个位点所有可能的genotype如下图: ["AA", "CC", "TT", "GG", "AC", "AT", "AG", "CT", "CG", "TG"]

利用贝叶斯定理判断SNV的方法就是根据测序数据(D)计算每一种genotype(G)的概率,即Pr(G|D)。

也就是计算所有十种可能性的概率, 那么,概率最高的就是该位点的genotype。

根据贝叶斯定理计算Pr(G|D)公式如下:

\[Pr(G | D)=Pr(G)*Pr(D | G)/Pr(D)\]

我们的目的是比较不同genotype的概率,因此计算Pr(G)*Pr(D|G)/Pr(D)即可。

\[Pr(G|D)\propto Pr(G)*Pr(D|G)\]

现在有两个问题:

  1. 如何计算Pr(D|G)?
  2. 如何计算P(G)?

1.如何计算Pr(D|G)? 下面通过一个例子来说明计算Pr(D|G)的方法。data如下:

假设G为A1A2,如下:

Pr(D|G)的计算公式如下:

\[Pr(\mathcal{D}|G)=\prod_{i=1..n}Pr(b_{i}|\{A_{1},A_{2}\})=\prod_{i=1..n}\frac{Pr(b_{i}|A_{1})+Pr(b_{i}|A_{2})}{2}\]
\[b_{j}\in\{\mathrm{A},\mathrm{G},\mathrm{A}\}\]

P(b|A)的概率如下, 其中e为对应碱基的质量值:

\[\text{当 }b_i=A_j\quad Pr(b_i|A_j)=1-e_i\]
\[\text{当 }b_i\neq A_j\quad Pr(b_i|A_j)=e_i/3\]

当Genotype=AG时,套用上面的公式,首先分别计算不同碱基对应的结果:

\[Pr(b_1=\mathtt A|\mathtt A\mathtt G)=\quad\frac{1}{2}\left((1-10^{-2})+\frac{10^{-2}}{3}\right)=\quad0.49667\]
\[Pr(b_2=\mathtt G|\mathtt A\mathtt G)=\quad\frac{1}{2}\left(\frac{10^{-1}}{3}+(1-10^{-1})\right)=\quad0.466667\]
\[Pr(b_{3}=\mathtt A|\mathtt A\mathtt G)=\quad\frac{1}{2}\left((1-10^{-5})+\frac{10^{-5}}{3}\right)=\quad0.499997\]

最终的结果如下:

\[Pr({D}|{AG})=Pr(b_1={A}|{AG}) * Pr(b_2={G}|{AG}) * Pr(b_3={A}|{AG}) = 0.49667 * 0.466667 * 0.499997=0.115888\]
  1. 如何计算P(G)?

计算P(G), 就是计算10种可能的genotype的概率, 即Pr(AA),Pr(TT),Pr(CC)等等。

假定参考碱基为G,如果杂合突变率为0.001, 纯合突变率为0.0005。genotype的概率如下: 有了Pr(D|AG)和Pr(G)的概率, Pr(AG|D)就可以计算出来了, 如下:

\[Pr({AG}|{D})= Pr({D}|{AG})Pr({AG})=0.115888*0.001=0.000116\]

同理可以计算出其他genotype的概率, 如下:

因此Pr(AG)是10种genotype的概率最高的, 该位点的genotype为AG。

实际SNV calling的方法,可能还有很多细节, 或者其他方法,我这里分享的东西也可能存在不少错误,如果你发现了, 欢迎留言。

2. 代码实现

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import math
import itertools as it

# 计算基因型的先验概率
def calc_prior_prob(ref_base, g1, g2):
    # 假设已知的参数
    # REF_BASE = 'G'  # 参考碱基
    HET_RATE = 0.001  # 杂合突变率
    HOM_RATE = 0.0005  # 纯合突变率
    if g1 == g2 == ref_base:
        return 1.0 - HET_RATE - HOM_RATE
    elif g1 != g2:
        return HET_RATE
    else:
        return HOM_RATE

# 根据PHRED质量值计算错误率
def get_error_rate(qual):
    # 将PHRED分数转换为概率
    error_rate = 10 ** (-qual / 10.0)
    return error_rate

# 计算读取数据的条件概率
def calc_likelihood2(data, g1, g2, base_qualities):
    likelihood = 1.0
    for base, qual in zip(data, base_qualities):
        p_err = get_error_rate(qual)
        p_A1 = get_genotype_rate(base, p_err, g1)
        p_A2 = get_genotype_rate(base, p_err, g2)
        likelihood *= (p_A1 + p_A2) / 2
    return likelihood


def get_genotype_rate(base, p_err, genotype):
    if base == genotype:
        return 1.0 - p_err
    else:
        return p_err / 3

# 计算后验概率
def calc_posterior_prob(ref_base, data, base_qualities):
    probs = {}
    genotype_list = get_genotype(ref_base, data)
    for g1,g2 in genotype_list:
        prior = calc_prior_prob(ref_base, g1, g2)
        likelihood = calc_likelihood2(data, g1, g2, base_qualities)
        probs[(g1, g2)] = prior * likelihood
    return probs


def get_genotype(ref_base, data):
    res = []
    tmp = list(set(data))
    if ref_base not in tmp:
        tmp.append(ref_base)
    for e in it.combinations_with_replacement(tmp, 2):
        res.append(e)
    return res


# 使用示例
REF_BASE = 'G'  # 参考碱基
data = ['A', 'G', 'A']
base_qualities = [20, 10, 50]
posterior_probs = calc_posterior_prob(REF_BASE, data, base_qualities)

# 输出每种基因型的后验概率
for genotype, prob in sorted(posterior_probs.items(), key=lambda x: x[1], reverse=True):
    print(f"Genotype {genotype}: {prob:.6f}")

fq2fa 尝试使用 fast_zlib

起因

看到适用于绝大部分临床NGS数据分析的底层高度性能优化方案想测试一下效果

1. 查看 fq2fa.c 文件

cat fq2fa.c

C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <stdio.h>
#include <zlib.h>
#include "klib/kseq.h"


KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{

        gzFile fp;
        gzFile fo;
        if (argc < 2 ){
            return -1;
        }
        if ( argc == 3 ){
            fo = gzopen (argv[2], "wb");
        }

        kseq_t *seq;
        int l;
        if (argc == 1){
            fprintf(stderr, "Usage: %s <in.fasta|in.fasta.gz>\n", argv[0]);
            return 1;
        }

        fp = gzopen(argv[1], "r");
        seq = kseq_init(fp); // 分配内存给seq
        while( (l = kseq_read(seq)) >= 0){ //读取数据到seq中
            gzprintf(fo, "%s", seq->name.s);
            gzprintf(fo, "%s", seq->seq.s);
        }

        kseq_destroy(seq); //释放内存
        gzclose(fp);
        if (argc == 3) gzclose(fo);
        return 0;


}

2. 系统安装zlib

Bash
1
yum install -y zlib1g-dev zlib zlib-devel
Text Only
1
apt-get install -y zlib1g zlib1g.dev zlib

3. 下载fast_zlib, 下载zlib

  • 下载 klib
    Text Only
    1
    git clone https://github.com/attractivechaos/klib.git
    
    移动到 fq2fa 文件夹
Bash
1
git clone https://github.com/gildor2/fast_zlib.git
  • http://www.zlib.net/
  • 下载 http://www.zlib.net/zlib-1.2.12.tar.gz
    Bash
    1
    2
    $ sha256sum zlib-1.2.12.tar.gz 
    91844808532e5ce316b3c010929493c0244f3d37593afd6de04f71821d5136d9  zlib-1.2.12.tar.gz
    

4. 修改zlib代码

  1. 复制 fast_zlib/Sources/match.hzlib-1.2.12
  2. cd zlib-1.2.12
  3. mv deflate.c deflate.old.c
  4. vim deflate.c
  5. 写入:
C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#define ASMV
#include "deflate.old.c"

#undef local
#define local

#include "match.h"

void match_init()
{
}

5. 编译安装 zlib-1.2.12

Bash
1
2
./configure --prefix=/home/lixy/Clion/fast_zlib_test/zlib-1.2.12/build --shared --static
make && make install

6. 编译链接 fq2fa.c

下载klib

Text Only
1
git clone git@github.com:attractivechaos/klib.git
Text Only
1
2
gcc -o fq2fa_zlib fq2fa.c -lz -Lzlib 
gcc -o fq2fa_fast_zlib fq2fa.c -I/home/lixy/Clion/fast_zlib_test/zlib-1.2.12/build/include -L/home/lixy/Clion/fast_zlib_test/zlib-1.2.12/build/lib -lz

查看 MD5:

Text Only
1
2
3
$  md5sum fq2fa_zlib fq2fa_fast_zlib 
4c03dc0377470f6a589e1bb4a9ffb7b0  fq2fa_zlib
e237f08440a7db5821aa902c5a8cfc1a  fq2fa_fast_zlib

7. 测试 zlib 和 fast_zlib 版本各自的速度

生成测试文件, in.fq.gz 太小,体现不出速度差异

Bash
1
2
3
for i in $(seq 1 4000);do cat in.fq.gz >> test.fq.gz ;done
$ zcat in.fq.gz |wc -l 
4000000

fq2fa_zlib:

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /bin/time -v ./fq2fa_zlib test.fq.gz test_zlib.fa.gz  
    Command being timed: "./fq2fa_zlib test.fq.gz test_zlib.fa.gz"
    User time (seconds): 35.82
    System time (seconds): 0.17
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.29
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 980
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 289
    Voluntary context switches: 14
    Involuntary context switches: 813
    Swaps: 0
    File system inputs: 205752
    File system outputs: 93184
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

fq2fa_fast_zlib:

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /bin/time -v ./fq2fa_fast_zlib test.fq.gz test_fast_zlib.fa.gz  
    Command being timed: "./fq2fa_fast_zlib test.fq.gz test_fast_zlib.fa.gz"
    User time (seconds): 34.85
    System time (seconds): 0.11
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.31
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 980
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 290
    Voluntary context switches: 5
    Involuntary context switches: 730
    Swaps: 0
    File system inputs: 8
    File system outputs: 93184
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

查看输出结果一致性:

Bash
1
2
3
$ md5sum test_zlib.fa.gz test_fast_zlib.fa.gz  
5bd3de8cf81c78962aa7100da6ab2719  test_zlib.fa.gz
5bd3de8cf81c78962aa7100da6ab2719  test_fast_zlib.fa.gz

8. 疑问?

  • fast_zlib 对 zlib的优化是否成功? 如果成功了,为什么两个版本的程序速度没有差异

9. 听从大佬建议,使用静态库

或者 直接用

Bash
1
2
3
4
5
gcc -o fq2fa_fast_zlib fq2fa.c /home/lixy/Clion/fast_zlib_test/zlib-1.2.12/build/lib/libz.a

$ md5sum fq2fa_zlib fq2fa_fast_zlib               
4c03dc0377470f6a589e1bb4a9ffb7b0  fq2fa_zlib
262db896e101b93ca1f2b0b7b6ee8ddd  fq2fa_fast_zlib
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /bin/time -v ./fq2fa_fast_zlib test.fq.gz test_fast_zlib.fa.gz                          
    Command being timed: "./fq2fa_fast_zlib test.fq.gz test_fast_zlib.fa.gz"
    User time (seconds): 24.68
    System time (seconds): 0.12
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.99
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1012
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 292
    Voluntary context switches: 6
    Involuntary context switches: 535
    Swaps: 0
    File system inputs: 0
    File system outputs: 93240
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

可以看到,速度明显快了⅓

10. 发现异常

  • 上述测试是在Centos 7.9, 2 CPUs, 4G MEM 环境下测试
  • 切换至 Ubuntu 18.04, 36 CPUs, 128G MEM / Ubuntu 20.04, 32 CPUs, 128G MEM后,发现 优化后的速度还不如不优化
Bash
1
2
for i in $(seq 1 10);do printf "test.fq.gz ";done
cat test.fq.gz test.fq.gz test.fq.gz test.fq.gz test.fq.gz test.fq.gz test.fq.gz test.fq.gz test.fq.gz test.fq.gz > aa.fq.gz 
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /usr/bin/time -v ./fq2fa_fast_zlib aa.fq.gz aa_fast_zlib.fa.gz 
    Command being timed: "./fq2fa_fast_zlib aa.fq.gz aa_fast_zlib.fa.gz"
    User time (seconds): 19.61
    System time (seconds): 0.02
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.63
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1888
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 152
    Voluntary context switches: 1
    Involuntary context switches: 25
    Swaps: 0
    File system inputs: 0
    File system outputs: 93128
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /usr/bin/time -v ./fq2fa_zlib aa.fq.gz aa_zlib.fa.gz     
    Command being timed: "./fq2fa_zlib aa.fq.gz aa_zlib.fa.gz"
    User time (seconds): 18.20
    System time (seconds): 0.03
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.24
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 2040
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 160
    Voluntary context switches: 1
    Involuntary context switches: 23
    Swaps: 0
    File system inputs: 0
    File system outputs: 93064
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
  • 推测:
  • Ubuntu 系统上fast_zliblongest_match 函数的实现 与 CentOS 系统上的不同,所以相同的修改效果不显著,甚至是无用的
  • 更换的两个Ubuntu系统均为多核心CPU, 高内存服务器,使得 fast_zliblongest_match 函数的优化仅能在 较少CPU和较少内存是体现优势

11. 解决 10 提出的异常

11.1 重新编译,保持单一变量

  • 上诉两个程序的编译命令不同,不符合单一变量原则
  • 解决:
  • 解压 zlib-1.2.12.tar.gz
  • cp -r zlib-1.2.12 fast_zlib-1.2.12
  • 不修改 zlib 代码,直接编译 zlib-1.2.12
  • 按 4 5 步骤,修改zlib代码,编译 fast_zlib-1.2.12
  • 分别编译链接 fq2fc
    • gcc -o fq2fa_fast_zlib fq2fa.c /home/lixy/myproject/fast_zlib_test/fast_zlib-1.2.12/build/lib/libz.a -I/home/lixy/myproject/fast_zlib_test/fast_zlib-1.2.12/build/include
    • gcc -o fq2fa_zlib fq2fa.c /home/lixy/myproject/fast_zlib_test/zlib-1.2.12/build/lib/libz.a -I/home/lixy/myproject/fast_zlib_test/zlib-1.2.12/build/include

测试两个文件的速度:

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /usr/bin/time -v ./fq2fa_zlib aa.fq.gz aa_zlib.fa.gz
    Command being timed: "./fq2fa_zlib aa.fq.gz aa_zlib.fa.gz"
    User time (seconds): 28.85
    System time (seconds): 0.05
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:28.91
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1892
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 153
    Voluntary context switches: 1
    Involuntary context switches: 37
    Swaps: 0
    File system inputs: 0
    File system outputs: 93064
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /usr/bin/time -v ./fq2fa_fast_zlib aa.fq.gz aa_fast_zlib.fa.gz 
    Command being timed: "./fq2fa_fast_zlib aa.fq.gz aa_fast_zlib.fa.gz"
    User time (seconds): 19.87
    System time (seconds): 0.05
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.92
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1948
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 152
    Voluntary context switches: 1
    Involuntary context switches: 26
    Swaps: 0
    File system inputs: 0
    File system outputs: 93128
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
  • 结论:在 ubuntu系统中,fast_zlib 项目对 zlib代码的修改,依旧有较大的速度提升

    • 新的问题:在 ubuntu系统中,直接使用 gcc -o fq2fa_zlib_u fq2fa.c -lz -Lzlib 编译链接,速度比 fast_zlib 修改版的尽然还要稍微快一点,原因是什么?
    • 使用 -lz -Lzlib 时候,使用的是系统的 zlib, 该版本比 zlib-1.2.12 有较大的速度提升 ?
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ /usr/bin/time -v ./fq2fa_zlib-ubuntu aa.fq.gz aa_zlib-ubuntu.fa.gz
    Command being timed: "./fq2fa_zlib-ubuntu aa.fq.gz aa_zlib-ubuntu.fa.gz"
    User time (seconds): 18.59
    System time (seconds): 0.07
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.68
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 2020
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 158
    Voluntary context switches: 2
    Involuntary context switches: 24
    Swaps: 0
    File system inputs: 0
    File system outputs: 93064
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

11.2 查看系统(ubuntu)中zlib的版本

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ cat /usr/lib/x86_64-linux-gnu/pkgconfig/zlib.pc
prefix=/usr
exec_prefix=${prefix}
libdir=${prefix}/lib/x86_64-linux-gnu
sharedlibdir=${libdir}
includedir=${prefix}/include

Name: zlib
Description: zlib compression library
Version: 1.2.11

Requires:
Libs: -L${libdir} -L${sharedlibdir} -lz
Cflags: -I${includedir}

11.3 那么,zlib-1.2.11 会比 zlib-1.2.12 更快吗?

测试如下:

Text Only
1
axel -n 8 https://github.com/madler/zlib/archive/refs/tags/v1.2.11.tar.gz
Text Only
1
2
$ md5sum zlib-1.2.11.tar.gz 
0095d2d2d1f3442ce1318336637b695f  zlib-1.2.11.tar.gz

编译安装

Text Only
1
2
3
mkdir build
./configure --prefix=/home/lixy/myproject/fast_zlib_test/zlib-1.2.11/build  --shared --static
make && make install

编译

Bash
1
2
3
4
5
6
7
8
9
gcc -o fq2fa_zlib-1.2.12 fq2fa.c /home/lixy/myproject/fast_zlib_test/zlib-1.2.12/build/lib/libz.a -I/home/lixy/myproject/fast_zlib_test/zlib-1.2.12/build/include

gcc -o fq2fa_zlib-1.2.11 fq2fa.c /home/lixy/myproject/fast_zlib_test/zlib-1.2.11/build/lib/libz.a -I/home/lixy/myproject/fast_zlib_test/zlib-1.2.11/build/include

gcc -o fq2fa_zlib-ubuntu fq2fa.c -lz -Lzlib

(
    gcc -o fq2fa_zlib-ubuntu fq2fa.c /usr/lib/x86_64-linux-gnu/libz.a -I/usr/include/
)

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
$ /usr/bin/time -v ./fq2fa_zlib-1.2.11 aa.fq.gz aa_zlib-1.2.11.fa.gz  
    Command being timed: "./fq2fa_zlib-1.2.11 aa.fq.gz aa_zlib-1.2.11.fa.gz"
    User time (seconds): 29.69
    System time (seconds): 0.03
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.73
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1948
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 153
    Voluntary context switches: 1
    Involuntary context switches: 38
    Swaps: 0
    File system inputs: 0
    File system outputs: 93064
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0



$ /usr/bin/time -v ./fq2fa_zlib-1.2.12 aa.fq.gz aa_zlib-1.2.12.fa.gz   
    Command being timed: "./fq2fa_zlib-1.2.12 aa.fq.gz aa_zlib-1.2.12.fa.gz"
    User time (seconds): 29.02
    System time (seconds): 0.07
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.10
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1948
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 153
    Voluntary context switches: 2
    Involuntary context switches: 39
    Swaps: 0
    File system inputs: 0
    File system outputs: 93064
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0


$ /usr/bin/time -v ./fq2fa_zlib-ubuntu aa.fq.gz aa_zlib-ubuntu.fa.gz 
    Command being timed: "./fq2fa_zlib-ubuntu aa.fq.gz aa_zlib-ubuntu.fa.gz"
    User time (seconds): 18.58
    System time (seconds): 0.03
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.61
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 2008
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 157
    Voluntary context switches: 1
    Involuntary context switches: 22
    Swaps: 0
    File system inputs: 0
    File system outputs: 93064
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
  • 发现,我们编译的 zlib-1.2.11 速度和 zlib-1.2.12 一样,但是却比使用系统默认zlib的版本慢很多
  • 难道是 zlib 在编译的过程中,可以加入一些优化参数?
  • 后续我在 一个docker image 中测试了几种版本的区别,发现,现在 ubuntu 中安装 zlib 相关的包,再 gcc -o fq2fa_zlib-ubuntu fq2fa.c -lz -Lzlib 出的程序,确实比 使用 zlib-1.2.11 速度和 zlib-1.2.12 快,原因未知。

12 为什么系统自带的zlib(deb 1.2.11)比我们手动编译的要快

12.1 查看 apt 安装的软件是什么版本,下载到本地,看一下configure.log

Bash
1
2
3
4
5
6
$ apt list --installed | rg zlib

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

zlib1g-dev/focal-updates,focal-security,now 1:1.2.11.dfsg-2ubuntu1.3 amd64 [installed]
zlib1g/focal-updates,focal-security,now 1:1.2.11.dfsg-2ubuntu1.3 amd64 [installed,automatic]

网上搜索ubuntu 中得这些包,找到了 build.log 文件 - https://www.ubuntuupdates.org/pm/zlib1g-dev - https://www.ubuntuupdates.org/pm/zlib1g

  • https://launchpadlibrarian.net/593215494/buildlog_ubuntu-focal-amd64.zlib_1%3A1.2.11.dfsg-2ubuntu1.3_BUILDING.txt.gz

build.log 文件 编译命令就是:

Text Only
1
AR=ar CC="x86_64-linux-gnu-gcc" CFLAGS="`dpkg-buildflags --get CFLAGS` `dpkg-buildflags --get CPPFLAGS` -Wall -D_REENTRANT -O3 -DUNALIGNED_OK" LDFLAGS="`dpkg-buildflags --get LDFLAGS`" uname=GNU ./configure --shared --prefix=/usr --libdir=\${prefix}/lib/x86_64-linux-gnu

创建 build.sh

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
#@File    :   build.sh
#@Time    :   2022/08/18 10:41:29
#@Author  :   biolxy
#@Version :   1.0
#@Contact :   biolxy@aliyun.com
#@Desc    :   None

SCRIPT_FOLDER=$(cd "$(dirname "$0")";pwd)

make distclean
test -d _build && rm -rf _build
mkdir _build


# --static
# --shared

AR=ar CC="x86_64-linux-gnu-gcc" CFLAGS="`dpkg-buildflags --get CFLAGS` `dpkg-buildflags --get CPPFLAGS` -Wall -D_REENTRANT -O3 -DUNALIGNED_OK" LDFLAGS="`dpkg-buildflags --get LDFLAGS`" uname=GNU ./configure --shared --prefix=${SCRIPT_FOLDER}/_build

make && make install

执行 bash ./build 重新编译链接程序,发现 手动编译的程序的速度也来到了 22s, 可以确定确实是不同的编译参数导致zlib库文件的执行效率不同

12.2 fast_zlib 的优化 是否能与 ubuntu zlib的编译参数一起使用

  • 可以,但是没有效,编译出的 fq2fa_fast_zlib-1.2.11 速度还是在 24s, 和未使用 ubuntu zlib的编译参数 前的速度一致

13 测试 zlib-1.3.1 和 fast_zlib-1.2.13 版本各自的速度

  • https://github.com/biolxy/zlib/tree/fast_zlib-v1.2.13

最近 zlib 升级到了 1.3.1 , fast_zlib 也升级到了 zlib-1.2.13

13.1 编译zlib 时不使用优化参数

Bash
1
./configure --prefix=/fast_zlib_test/zlib-1.3.1/_build --shared --static
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
$ /usr/bin/time -v ./fq2fa_fast_zlib-1.2.13 in.fq.gz out_fq2fa_zlib-1.2.13.fa.gz 
    Command being timed: "./fq2fa_fast_zlib-1.2.13 in.fq.gz out_fq2fa_zlib-1.2.13.fa.gz"
    User time (seconds): 24.58
    System time (seconds): 0.10
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.74
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1860
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 357
    Voluntary context switches: 144
    Involuntary context switches: 166
    Swaps: 0
    File system inputs: 0
    File system outputs: 92624
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

$ /usr/bin/time -v ./fq2fa_zlib-1.3.1 in.fq.gz out_fq2fa_zlib-1.3.1.fa.gz
    Command being timed: "./fq2fa_zlib-1.3.1 in.fq.gz out_fq2fa_zlib-1.3.1.fa.gz"
    User time (seconds): 37.16
    System time (seconds): 0.13
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.48
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1828
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 257
    Voluntary context switches: 154
    Involuntary context switches: 300
    Swaps: 0
    File system inputs: 0
    File system outputs: 92608
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

13.2 编译时添加 -03 优化后

Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
$ /usr/bin/time -v ./fq2fa_fast_zlib-1.2.13 in.fq.gz out_fq2fa_zlib-1.2.13.fa.gz
    Command being timed: "./fq2fa_fast_zlib-1.2.13 in.fq.gz out_fq2fa_zlib-1.2.13.fa.gz"
    User time (seconds): 23.62
    System time (seconds): 0.12
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.84
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1852
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 152
    Voluntary context switches: 188
    Involuntary context switches: 163
    Swaps: 0
    File system inputs: 0
    File system outputs: 92624
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0


$ /usr/bin/time -v ./fq2fa_zlib-1.3.1 in.fq.gz out_fq2fa_zlib-1.3.1.fa.gz
    Command being timed: "./fq2fa_zlib-1.3.1 in.fq.gz out_fq2fa_zlib-1.3.1.fa.gz"
    User time (seconds): 21.15
    System time (seconds): 0.10
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.42
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1856
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 357
    Voluntary context switches: 58
    Involuntary context switches: 191
    Swaps: 0
    File system inputs: 0
    File system outputs: 92608
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
  • 结论1: 编译zlib时不使用 -03时, fast_zlib-1.2.13 和 zlib-1.3.1 耗时比是 24.74 / 37.48 = 0.6601 , 可节约 ⅓ 的时间
  • 结论2: 编译zlib时使用 -03时, fast_zlib-1.2.13 和 zlib-1.3.1 耗时比是 23.84 / 21.42, 不能节约时间
  • 建议: 以后写c/cpp项目多使用 -O3