Replies: 7 comments 4 replies
-
|
hi @skovaka, can I check the command that you used for nanopolish and the m6anet dataprep? Also, did you use the bamfile / basecalling result for the hct116? 16,022 lines is far too little and could explain the bad performance of the model |
Beta Was this translation helpful? Give feedback.
-
|
Hi @chrishendra93, thanks for the reply. Here are the commands I used, with the reads and alignments downloaded directly from the SG-NEx AWS bucket: Also, here is the output of I also tried basecalling and aligning myself and got similar results. Thanks again for looking into this! Sam |
Beta Was this translation helpful? Give feedback.
-
|
I think I've determined that the "hct116_data.readcount" file from the code ocean link actually corresponds to replicate 3 run 1? I computed coverage pileups from the BAM files for all HCT116 direct RNA runs, and only replicate 3 run 1 had at least 20x coverage for every site present in the training labels. I saw that this run has preprocessed m6anet data availible on AWS, and the counts in the "data.readcount" match the counts in the "hct116_data.readcount" labels. I also downloaded the FAST5 files labeled "SGNex_Hct116_directRNA_replicate3_run1" and noticed a subdirectory named "GIS_Hct116_directRNA_Rep2-Run1", which suggests mislabling to me. Can you confirm whether replicate 3 run 1 was used to train m6Anet? Like you said, it doesn't seem like replicate 2 run 1 has enough coverage for training. Also, is that still the default model after the recent 2.0 version release? Thanks again for your help! |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for looking into it! I've successfuly trained using the preprocessed m6anet data from SG-NEx. Based on my limited testing, it looks like the accuracy is slightly higher than m6anet version 1 but slightly lower than the current version 2. The difference could just be random noise, but I'm working on re-basecalling the training data and re-running everything in case that makes a difference. Just to make sure, does this training command look correct to you? And one final question - which output file should be used for the Thanks again for all your help, |
Beta Was this translation helpful? Give feedback.
-
|
hi @skovaka and @chrishendra93 , I got a very similar issue to @skovaka . I'm using all the data from SG-NEx AWS bucket, including fast5, bam, fastq and genome's gtf and fasta. After preprocessing, I finally got 16114 transcript-level sites with more than 20 reads mapping on SGNex_Hct116_directRNA_replicate2_run1. Is there any solution so far? Could you guys provide a detailed script of preprocessing the fast5 data? Looking forward to your reply! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @zhfanrui, I can't say definitively, but I'm convinced that "SGNex_Hct116_directRNA_replicate3_run1" was used for training. The coverage from the BAM file is consistent with the labels from code ocean, and the FAST5 files have a subdirectory labeled "GIS_Hct116_directRNA_Rep2-Run1", which @chrishendra93 said was used for training. It would be nice to get definitive confirmation (from @jonathangoeke or @yuukiiwa?), but I got good results using for training so I'm fairly confident. Thanks, |
Beta Was this translation helpful? Give feedback.
-
|
Hi @skovaka and @jonathangoeke , Thanks for your replies! I downloaded the rep3 data and finally obtained 419,667 DRACH sites on transcript level, maping to 171,520 sites on genome level. This number is now larger than the readcount data provided on CodeOcean (121,838 sites). How is your data, Sam? Does anyone have any ideas about this? By the way, I also noticed m6Anet filtered the some of the sites that have more than 1,000 reads on one transcript by default. I am curious why we need do that. Thanks |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Thank you for providing this great tool! I am attempting to train m6Anet using the same HCT116 data used in the paper (replicate 2 run 1), but I am having trouble with the training labels. I'm using the "hct116_data.readcount" file provided in the code ocean repository, which contains 121,839 lines, but when I run nanopolish and dataprep on the HCT116 data the "data.info" file only contains 16,022 lines. Additionally, the average read count in the labels is 90.5, while the real data average read count is 67.6. I've tried both using your provided basecalls and rebasecalling myself and got similar results.
I tried going ahead with the training data and applied the labels to my "data.info" file, but the resulting model performed very badly on the HCT116 data (AUPR=0.182 vs 0.339 for the default model). Any idea what's wrong? Should the "hct116_data.readcount" file resemble my "data.info" file more closely, or is it expected for them to differ so much?
Thanks,
Sam
Beta Was this translation helpful? Give feedback.
All reactions