The Assignment for Programming T raining

26
The Assignment for Programming Training Fu Yu

description

The Assignment for Programming T raining . Fu Yu. Usage. “python mapping_report INPUT_FILE”. The INPUT_FILE is the SAM file that you want to process. - PowerPoint PPT Presentation

Transcript of The Assignment for Programming T raining

Page 1: The Assignment  for  Programming  T raining

The Assignment for Programming Training

Fu Yu

Page 2: The Assignment  for  Programming  T raining

Usage• “python mapping_report INPUT_FILE”. The

INPUT_FILE is the SAM file that you want to process.

• Please keep in mind that you should first change the current directory to the folder that contains the Python script because the results will be put in the current directory.

Page 3: The Assignment  for  Programming  T raining

Results

In about 18s, the script finishes dealing with a SAM file that is approximately 160megebytes. And in the current directory there is a file named “gross result” that includes all the result. And there is a “Distribution_of_scores.pdf” in which you can find the quality score.

What’s more, my script deals with a SAM file that is more than 2G in only 356.87s

Ultrafast!!!

Page 4: The Assignment  for  Programming  T raining

Multithreading

• Besides, I tried to use multithreading technology to boost the program. But Python does not seem to be good at this field. It takes about 40s to finish the task. So I give up multithreading.

Page 5: The Assignment  for  Programming  T raining

Data source

• SRR037828.fastq was selected randomly from those .fastq files. It was mapped back to the

Page 6: The Assignment  for  Programming  T raining

Q1: Mapping report

• Generate a report about the number and percentage of tags that have been mapped back to genome, and the total number of all tags.

Page 7: The Assignment  for  Programming  T raining

Step1 - About SAM files

• Use the flag field to deicide it is mapped back or not.

Page 8: The Assignment  for  Programming  T raining

Step1 - Data

• Use unmapped to record the number of tags that are not mapped back and chr_mapped_num to store how many tags have found their locations back to the genome. This dictionary might look redundant, but it actually helps in later steps.

Page 9: The Assignment  for  Programming  T raining

Step1 - RegEx

• It utilizes the regular expression to get the the name of each chromosome and to get the length of each chromosome.

Page 10: The Assignment  for  Programming  T raining

Step1 - Getting the header

To get everything that the header contains. Besides, it handles possible exceptions in case the SAM file is corrupted.

Page 11: The Assignment  for  Programming  T raining

Step1 - Read in all the tags

Page 12: The Assignment  for  Programming  T raining

Q2: Quality score report

• Draw a distribution graph about the FASTQ quality score distribution within all mapped tags with R.

Page 13: The Assignment  for  Programming  T raining

Step2 - Loop

Put the score of each tag in to the “f_out_quality_score”, thus I can use rscript to deal with the score and draw the distribution.

Page 14: The Assignment  for  Programming  T raining

Step2 - R

Here, this Python script creates an R script and call it in the terminal so that we do not have to run the rscript by ourselves.

Page 15: The Assignment  for  Programming  T raining

Step3&4

They share the same loop because they use identical loop. This way, I can improve the efficiency of the script.

Page 16: The Assignment  for  Programming  T raining

Q3: Unique mapped tag

• Count the number of tags that each of them is mapped back to only one genomic location.

Page 17: The Assignment  for  Programming  T raining

Step3

Page 18: The Assignment  for  Programming  T raining

Step3• This step uses a dictionary: the key here is chr + symbol +

loc (e.g. chr1+112233) and the number of repeats is the value. If the some key has a value of 2 or more, then we count it out. All the keys that have value of 1 is totaled. And this is the result. The image above shows how the program handles + strands. In the try block, if the line does not have a 19th field, then the program goes into exception (which actually does nothing). Nonetheless, if it does, then keep it in the dictionary for later use.

Page 19: The Assignment  for  Programming  T raining

Q4: Unique mapping location

• Count the number of genomic locations that only have one tag mapped.

Page 20: The Assignment  for  Programming  T raining

Step4 – using the XA field

Use the XA field to decide how many genomic locations there are and what are the exact place the tags are back.

Page 21: The Assignment  for  Programming  T raining

Step4

If a line has ‘0’ or ’16’, together with the 19th field, then it is a tag that fulfills the condition given. Count the number and we get the result.

Page 22: The Assignment  for  Programming  T raining

Time complexity

• This script uses several loops. Step One relies on a loop that has repeats N times. (N is the number of tags. So its comlexity is O(N);

• Similarly, Step Three’s complexity is O(N); • However, Step Two and Step Four needs N*l (l is

the number of bases in each tag.) So the time complexity of the script is N*l.

Page 23: The Assignment  for  Programming  T raining

Time

• I use the “time” module to time the who process. And it takes about 20s for my script to cope with a SAM file that is approximately 160 megabytes.

Page 24: The Assignment  for  Programming  T raining

All in a single run

• All the 4 steps are done within the Python script. So we do not have to run “Rscript xxx.r” outside the script.

Page 25: The Assignment  for  Programming  T raining

Summary• Multithreading• Identify the meaning of each optional fields• Using dictionaries to count the number of tags• Using RegEx to capture the necessary

information. • Loop: trying to decrease the number of nested

loops as much as possible.

Page 26: The Assignment  for  Programming  T raining

• Thank you!