Training on HCT116 Data #110

skovaka · 2023-04-26T16:18:04Z

skovaka
Apr 26, 2023

Hello,

Thank you for providing this great tool! I am attempting to train m6Anet using the same HCT116 data used in the paper (replicate 2 run 1), but I am having trouble with the training labels. I'm using the "hct116_data.readcount" file provided in the code ocean repository, which contains 121,839 lines, but when I run nanopolish and dataprep on the HCT116 data the "data.info" file only contains 16,022 lines. Additionally, the average read count in the labels is 90.5, while the real data average read count is 67.6. I've tried both using your provided basecalls and rebasecalling myself and got similar results.

I tried going ahead with the training data and applied the labels to my "data.info" file, but the resulting model performed very badly on the HCT116 data (AUPR=0.182 vs 0.339 for the default model). Any idea what's wrong? Should the "hct116_data.readcount" file resemble my "data.info" file more closely, or is it expected for them to differ so much?

Thanks,
Sam

chrishendra93 · 2023-04-30T06:42:41Z

chrishendra93
Apr 30, 2023
Maintainer

hi @skovaka, can I check the command that you used for nanopolish and the m6anet dataprep? Also, did you use the bamfile / basecalling result for the hct116? 16,022 lines is far too little and could explain the bad performance of the model

0 replies

skovaka · 2023-05-01T13:47:33Z

skovaka
May 1, 2023
Author

Hi @chrishendra93, thanks for the reply. Here are the commands I used, with the reads and alignments downloaded directly from the SG-NEx AWS bucket:

$ nanopolish eventalign --reads SGNex_Hct116_directRNA_replicate2_run1.fastq.gz --bam SGNex_Hct116_directRNA_replicate2_run1.bam --genome Homo_sapiens.GRCh38.cdna.ncrna.fa --signal-index --scale-events -t 8 > nanopolish.txt
$ m6anet dataprep --eventalign nanopolish.txt --n_processes 12 --out_dir dataprep

Also, here is the output of samtools flagstat on the SGNex_Hct116_directRNA_replicate2_run1.bam file from AWS:

332249 + 0 in total (QC-passed reads + QC-failed reads)
139837 + 0 secondary
773 + 0 supplementary
0 + 0 duplicates
197747 + 0 mapped (59.52% : N/A)
0 + 0 paired in sequencing
...

I also tried basecalling and aligning myself and got similar results. Thanks again for looking into this!

Sam

0 replies

skovaka · 2023-05-02T20:03:13Z

skovaka
May 2, 2023
Author

I think I've determined that the "hct116_data.readcount" file from the code ocean link actually corresponds to replicate 3 run 1? I computed coverage pileups from the BAM files for all HCT116 direct RNA runs, and only replicate 3 run 1 had at least 20x coverage for every site present in the training labels. I saw that this run has preprocessed m6anet data availible on AWS, and the counts in the "data.readcount" match the counts in the "hct116_data.readcount" labels. I also downloaded the FAST5 files labeled "SGNex_Hct116_directRNA_replicate3_run1" and noticed a subdirectory named "GIS_Hct116_directRNA_Rep2-Run1", which suggests mislabling to me.

Can you confirm whether replicate 3 run 1 was used to train m6Anet? Like you said, it doesn't seem like replicate 2 run 1 has enough coverage for training. Also, is that still the default model after the recent 2.0 version release? Thanks again for your help!

1 reply

chrishendra93 May 3, 2023
Maintainer

hi Sam, thanks for bringing this up. GIS_Hct116_directRNA_Rep2-Run1 is the replicate that was used to training and is still the default model after the recent 2.0 version release. Let me check and get back to you on this. @jonathangoeke or @yuukiiwa are probably more familiar with this issue

skovaka · 2023-05-04T14:56:12Z

skovaka
May 4, 2023
Author

Thanks for looking into it! I've successfuly trained using the preprocessed m6anet data from SG-NEx. Based on my limited testing, it looks like the accuracy is slightly higher than m6anet version 1 but slightly lower than the current version 2. The difference could just be random noise, but I'm working on re-basecalling the training data and re-running everything in case that makes a difference. Just to make sure, does this training command look correct to you?

m6anet train --model_config ~/sw/m6anet/m6anet/model/configs/model_configs/m6anet.toml --train_config train/oversampled.toml --save_dir train/out_60 --device cuda:0 --epochs 60 --n_processes 4 --save_per_epoch 1 --num_iterations 5

And one final question - which output file should be used for the --model_state_dict when runing inference? I got the best results from model_states/60/model_states.pt, but using pr_auc.pt and roc_auc.pt worked similarly and I haven't tested on many samples.

Thanks again for all your help,
Sam

0 replies

zhfanrui · 2023-06-01T15:06:45Z

zhfanrui
Jun 1, 2023

hi @skovaka and @chrishendra93 , I got a very similar issue to @skovaka .

I'm using all the data from SG-NEx AWS bucket, including fast5, bam, fastq and genome's gtf and fasta. After preprocessing, I finally got 16114 transcript-level sites with more than 20 reads mapping on SGNex_Hct116_directRNA_replicate2_run1. Is there any solution so far? Could you guys provide a detailed script of preprocessing the fast5 data?

Looking forward to your reply!
Thanks!

0 replies

skovaka · 2023-06-05T16:36:14Z

skovaka
Jun 5, 2023
Author

Hi @zhfanrui,

I can't say definitively, but I'm convinced that "SGNex_Hct116_directRNA_replicate3_run1" was used for training. The coverage from the BAM file is consistent with the labels from code ocean, and the FAST5 files have a subdirectory labeled "GIS_Hct116_directRNA_Rep2-Run1", which @chrishendra93 said was used for training. It would be nice to get definitive confirmation (from @jonathangoeke or @yuukiiwa?), but I got good results using for training so I'm fairly confident.

Thanks,
Sam

1 reply

jonathangoeke Jun 6, 2023
Maintainer

Hi @skovaka yes the samples were renamed during the process, the SGNex_Hct116_directRNA_replicate3_run1 should be the right sample to use.

zhfanrui · 2023-06-08T10:16:30Z

zhfanrui
Jun 8, 2023

Hi @skovaka and @jonathangoeke ,

Thanks for your replies! I downloaded the rep3 data and finally obtained 419,667 DRACH sites on transcript level, maping to 171,520 sites on genome level. This number is now larger than the readcount data provided on CodeOcean (121,838 sites). How is your data, Sam? Does anyone have any ideas about this?

By the way, I also noticed m6Anet filtered the some of the sites that have more than 1,000 reads on one transcript by default. I am curious why we need do that.

Thanks

2 replies

skovaka Jun 9, 2023
Author

I'm using re-basecalled data and got 151,708 sites in my "data.info" file, which is less than the number of DRACH sites covered by BAM alignments, I believe because m6Anet does some pre-filtering of sites with very similar signal. I'm not 100% sure why the provided "hct116_data.readcount" file has fewer sites, but my best guess is they filtered out some transcripts without any modifications? I thought it was good sign that the "n_reads" was very similar for shared sites, and any variance could be explained by re-basecalling or changes since m6Anet version 2. Also, one mistake I made early on was not filtering out secondary alignments, which generates many more candidate sites. In the paper it looks like they only used primary and supplementary alignments (-ax map-ont -uf–secondary=no).

zhfanrui Jun 12, 2023

Your assistance is truly invaluable! Thank you so much for your help!

Training on HCT116 Data #110

Uh oh!

skovaka Apr 26, 2023

Replies: 7 comments · 4 replies

Uh oh!

chrishendra93 Apr 30, 2023 Maintainer

Uh oh!

skovaka May 1, 2023 Author

Uh oh!

Uh oh!

skovaka May 2, 2023 Author

Uh oh!

chrishendra93 May 3, 2023 Maintainer

Uh oh!

skovaka May 4, 2023 Author

Uh oh!

zhfanrui Jun 1, 2023

Uh oh!

skovaka Jun 5, 2023 Author

Uh oh!

jonathangoeke Jun 6, 2023 Maintainer

Uh oh!

zhfanrui Jun 8, 2023

Uh oh!

skovaka Jun 9, 2023 Author

Uh oh!

zhfanrui Jun 12, 2023

skovaka
Apr 26, 2023

Replies: 7 comments 4 replies

chrishendra93
Apr 30, 2023
Maintainer

skovaka
May 1, 2023
Author

skovaka
May 2, 2023
Author

chrishendra93 May 3, 2023
Maintainer

skovaka
May 4, 2023
Author

zhfanrui
Jun 1, 2023

skovaka
Jun 5, 2023
Author

jonathangoeke Jun 6, 2023
Maintainer

zhfanrui
Jun 8, 2023

skovaka Jun 9, 2023
Author