Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...

28
Lab 6 Problem 1: DNA

description

DNA Substring is consecutive part of a string. Note that AG is not a substring of ACGTAC.

Transcript of Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...

Page 1: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Lab 6

Problem 1: DNA

Page 2: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNAGiven a string with length N,

determine the number of occurrences of some given substrings (with length K) in that string.

For instance, String : ACGTAC (N = 6)Substring : AC (K = 2)Answer : There are 2 AC in string ACGTAC.

ACGTAC

Page 3: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNASubstring is consecutive part of

a string.Note that AG is not a substring of

ACGTAC.

Page 4: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Brute-force AlgorithmFor each queryIterate through the entire stringFor each position in the string,

check the substring, and increment count

Page 5: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (70%)for (int i = 0; i < N; i++) {boolean found = true;for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatchfound = false; break; }}if (found) counter++; }

Page 6: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (70%)We can answer one query in

O(N.K)Hence with Q queries, the time

complexity will be O(Q.N.K)Solution: For every query, we

check the substring with length K starting at index i

Page 7: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (100%)Java HashTable

Page 8: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (100%)Key: substringValue: Number of occurrences of

substringIterate through string once to

populate hashtable O(NK)Constant time for each query

Page 9: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (100%)ACGTACACGTACACGTACACGTACACGTACStore the substrings as key. AC, CG, GT, TA.

Page 10: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (100%)We will have:occur[AC] = 2occur[CG] = 1occur[GT] = 1occur[TA] = 1for (int i = 0; i < N – K + 1; i++) {

occur[hash(i, K)]++; // we increase the substring starting at index i with length K.

}

Page 11: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA(100%)After we have built the table, we

can answer a query in O(1) By searching the hash table with

the query as the key

Page 13: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA – V2Implement our own hash table!Since K is very small, we can use

simple hash function and array as the table.

Page 14: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA-V2Hash function?First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence).

Page 15: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA-V2ACGTACACGTACACGTACACGTACACGTACWe only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41.

Page 16: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA-V2We will have:occur[12] = 2occur[23] = 1occur[34] = 1occur[41] = 1for (int i = 0; i < N – K + 1; i++) {

occur[hash(i, K)]++; // we increase the substring starting at index i with length K.

}

Page 17: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

DNA (100%)After we have built the table, we

can answer a query in O(K) by calculating the hash value of the substring in that query (X)

Output the value in occur[X].

Page 18: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Problem 2: Find Substring

Page 19: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Find SubstringGiven 2 strings, Output 0: if a substring is not in

string1&2Output 1: if a substring is only in

string 1Output 2: if a substring is only in

string 2Output 3: if a substring is in both

string 1&2

Page 20: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Find Substring (70%)Check the existence of a

substring in both strings to determine the answer.

You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0.

Can be solved using the same technique for DNA(70%)

Page 21: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Find Substring (100%)It is possible to reuse the solution

for DNAIf the number of occurrences of a

substring in a given string > 0, it means that we can find the substring in the string.

You need 2 tables, one for the first string and another one for the second string

Page 22: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Find Substring (100%)For example, we have 2 strings,

i.e.ACGTAC and ACTGCAUse the same technique as the one in DNA

Page 23: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Find Substring (100%)After we have built the table, we

can answer a query in O(1) E.g. check occurOne.get(“AC”)

and occur2.get(“AC”)

Page 24: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Incantation-E

Page 25: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

TaskFind a interval (continuous

section)◦Contains all incantations◦Total length is minimal

{acer, wei, wei, acer, acer, jing, acer, wei}

Page 26: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Idea{acer, wei, wei, acer, acer, jing,

acer, wei}Maintain the interval using a

queue◦Step1: Initially empty {[]acer, wei,

wei, acer, acer, jing, acer, wei}◦Step2: While the queue does not

contain all words, add words at the back of the queue {[acer, wei, wei, acer, acer, jing], acer,

wei}

Page 27: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Idea◦Step3: While the front of the queue

is redundant, pop it out, and update the minimum total length {acer, wei, [wei, acer, acer, jing], acer,

wei}, min = 15◦Step4: if not reach the end of the list,

add the next word at the back of the queue, and goto Step3

◦Final Answer: {acer, wei, wei, acer, acer, [jing, acer, wei]}, min = 11

Page 28: Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Time Complexity: O(N).

How to check whether the first word in the queue is redundant?◦Hashing to store the word’s

occurrence in the queue.