I downloaded some data from GTEx following the instruction: https://anvilproject.org/learn/reference/gtex-v8-free-egress-instructions
Using the command:
gen3-client download-multiple --profile=${profile_name} -manifest=${manifest_name}.json --no-prompt --numparallel=20 --protocol=s3 --download-path=${download_path} --skip-completed
Although after downloading all WES data and some RNA-seq data, I found out that a considerable number of files are corrupted. Specifically, when using samtools view on these bam files, samtools gave the following error:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_uncompress] Inflate operation failed: progress temporarily not possible, or in() / out() returned an error
[E::bgzf_read] Read block operation failed with error 1 after 1 of 267 bytes
[main_samview] truncated file.
16838939
So I checked the last 28 bytes of the file, which are supposed to be EOF of a bam file:
tail -c 28 ${bam_file_name}.bam | xxd -p
which shows that the last 28 bytes are all 0s instead of EOF. Then I checked last part of each bam file affected by the issue and found out that all of those bam files have large chunks of 0s at the end of the files.
All of the downloading mentioned before was performed on a HPC under centos-release-7-8.2003.0.el7.centos.x86_64. Then I downloaded one of the bam file on my local mac machine, which is also affected by this issue.
Can someone help to check out what is going on here? Thanks a lot!