Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...

Lab 6

Problem 1: DNA

DNAGiven a string with length N,

determine the number of occurrences of some given substrings (with length K) in that string.

For instance, String : ACGTAC (N = 6)Substring : AC (K = 2)Answer : There are 2 AC in string ACGTAC.

ACGTAC

DNASubstring is consecutive part of

a string.Note that AG is not a substring of

ACGTAC.

Brute-force AlgorithmFor each queryIterate through the entire stringFor each position in the string,

check the substring, and increment count

DNA (70%)for (int i = 0; i < N; i++) {boolean found = true;for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatchfound = false; break; }}if (found) counter++; }

DNA (70%)We can answer one query in

O(N.K)Hence with Q queries, the time

complexity will be O(Q.N.K)Solution: For every query, we

check the substring with length K starting at index i

DNA (100%)Java HashTable

DNA (100%)Key: substringValue: Number of occurrences of

substringIterate through string once to

populate hashtable O(NK)Constant time for each query

DNA (100%)ACGTACACGTACACGTACACGTACACGTACStore the substrings as key. AC, CG, GT, TA.

DNA (100%)We will have:occur[AC] = 2occur[CG] = 1occur[GT] = 1occur[TA] = 1for (int i = 0; i < N – K + 1; i++) {

occur[hash(i, K)]++; // we increase the substring starting at index i with length K.

}

DNA(100%)After we have built the table, we

can answer a query in O(1) By searching the hash table with

the query as the key

AlternativeWhat if we do not have Java Hash

Table API?

https://www.google.com.sg/search?q=shocked+cartoon+face&tbm=isch&imgil=kDf9VsCB2Cd0xM%253A%253B1ND2hT7DEUAM1M%253Bhttp%25253A%25252F%25252Fcliparts.co%25252Fshocked-cartoon-face&source=iu&pf=m&fir=kDf9VsCB2Cd0xM%253A%252C1ND2hT7DEUAM1M%252C_&usg=__boqhOnwyKMw7rBYAC8zRixfq8gM%3D

DNA – V2Implement our own hash table!Since K is very small, we can use

simple hash function and array as the table.

DNA-V2Hash function?First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence).

DNA-V2ACGTACACGTACACGTACACGTACACGTACWe only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41.

DNA-V2We will have:occur[12] = 2occur[23] = 1occur[34] = 1occur[41] = 1for (int i = 0; i < N – K + 1; i++) {

occur[hash(i, K)]++; // we increase the substring starting at index i with length K.

}

DNA (100%)After we have built the table, we

can answer a query in O(K) by calculating the hash value of the substring in that query (X)

Output the value in occur[X].

Problem 2: Find Substring

Find SubstringGiven 2 strings, Output 0: if a substring is not in

string1&2Output 1: if a substring is only in

string 1Output 2: if a substring is only in

string 2Output 3: if a substring is in both

string 1&2

Find Substring (70%)Check the existence of a

substring in both strings to determine the answer.

You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0.

Can be solved using the same technique for DNA(70%)

Find Substring (100%)It is possible to reuse the solution

for DNAIf the number of occurrences of a

substring in a given string > 0, it means that we can find the substring in the string.

You need 2 tables, one for the first string and another one for the second string

Find Substring (100%)For example, we have 2 strings,

i.e.ACGTAC and ACTGCAUse the same technique as the one in DNA

Find Substring (100%)After we have built the table, we

can answer a query in O(1) E.g. check occurOne.get(“AC”)

and occur2.get(“AC”)

Incantation-E

TaskFind a interval (continuous

section)◦Contains all incantations◦Total length is minimal

{acer, wei, wei, acer, acer, jing, acer, wei}

Idea{acer, wei, wei, acer, acer, jing,

acer, wei}Maintain the interval using a

queue◦Step1: Initially empty {[]acer, wei,

wei, acer, acer, jing, acer, wei}◦Step2: While the queue does not

contain all words, add words at the back of the queue {[acer, wei, wei, acer, acer, jing], acer,

wei}

Idea◦Step3: While the front of the queue

is redundant, pop it out, and update the minimum total length {acer, wei, [wei, acer, acer, jing], acer,

wei}, min = 15◦Step4: if not reach the end of the list,

add the next word at the back of the queue, and goto Step3

◦Final Answer: {acer, wei, wei, acer, acer, [jing, acer, wei]}, min = 11

Time Complexity: O(N).

How to check whether the first word in the queue is redundant?◦Hashing to store the word’s

occurrence in the queue.

Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...

Documents

Transcript of Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...