Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...
-
Upload
claribel-douglas -
Category
Documents
-
view
217 -
download
0
description
Transcript of Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some...
Lab 6
Problem 1: DNA
DNAGiven a string with length N,
determine the number of occurrences of some given substrings (with length K) in that string.
For instance, String : ACGTAC (N = 6)Substring : AC (K = 2)Answer : There are 2 AC in string ACGTAC.
ACGTAC
DNASubstring is consecutive part of
a string.Note that AG is not a substring of
ACGTAC.
Brute-force AlgorithmFor each queryIterate through the entire stringFor each position in the string,
check the substring, and increment count
DNA (70%)for (int i = 0; i < N; i++) {boolean found = true;for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatchfound = false; break; }}if (found) counter++; }
DNA (70%)We can answer one query in
O(N.K)Hence with Q queries, the time
complexity will be O(Q.N.K)Solution: For every query, we
check the substring with length K starting at index i
DNA (100%)Java HashTable
DNA (100%)Key: substringValue: Number of occurrences of
substringIterate through string once to
populate hashtable O(NK)Constant time for each query
DNA (100%)ACGTACACGTACACGTACACGTACACGTACStore the substrings as key. AC, CG, GT, TA.
DNA (100%)We will have:occur[AC] = 2occur[CG] = 1occur[GT] = 1occur[TA] = 1for (int i = 0; i < N – K + 1; i++) {
occur[hash(i, K)]++; // we increase the substring starting at index i with length K.
}
DNA(100%)After we have built the table, we
can answer a query in O(1) By searching the hash table with
the query as the key
AlternativeWhat if we do not have Java Hash
Table API?
DNA – V2Implement our own hash table!Since K is very small, we can use
simple hash function and array as the table.
DNA-V2Hash function?First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence).
DNA-V2ACGTACACGTACACGTACACGTACACGTACWe only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41.
DNA-V2We will have:occur[12] = 2occur[23] = 1occur[34] = 1occur[41] = 1for (int i = 0; i < N – K + 1; i++) {
occur[hash(i, K)]++; // we increase the substring starting at index i with length K.
}
DNA (100%)After we have built the table, we
can answer a query in O(K) by calculating the hash value of the substring in that query (X)
Output the value in occur[X].
Problem 2: Find Substring
Find SubstringGiven 2 strings, Output 0: if a substring is not in
string1&2Output 1: if a substring is only in
string 1Output 2: if a substring is only in
string 2Output 3: if a substring is in both
string 1&2
Find Substring (70%)Check the existence of a
substring in both strings to determine the answer.
You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0.
Can be solved using the same technique for DNA(70%)
Find Substring (100%)It is possible to reuse the solution
for DNAIf the number of occurrences of a
substring in a given string > 0, it means that we can find the substring in the string.
You need 2 tables, one for the first string and another one for the second string
Find Substring (100%)For example, we have 2 strings,
i.e.ACGTAC and ACTGCAUse the same technique as the one in DNA
Find Substring (100%)After we have built the table, we
can answer a query in O(1) E.g. check occurOne.get(“AC”)
and occur2.get(“AC”)
Incantation-E
TaskFind a interval (continuous
section)◦Contains all incantations◦Total length is minimal
{acer, wei, wei, acer, acer, jing, acer, wei}
Idea{acer, wei, wei, acer, acer, jing,
acer, wei}Maintain the interval using a
queue◦Step1: Initially empty {[]acer, wei,
wei, acer, acer, jing, acer, wei}◦Step2: While the queue does not
contain all words, add words at the back of the queue {[acer, wei, wei, acer, acer, jing], acer,
wei}
Idea◦Step3: While the front of the queue
is redundant, pop it out, and update the minimum total length {acer, wei, [wei, acer, acer, jing], acer,
wei}, min = 15◦Step4: if not reach the end of the list,
add the next word at the back of the queue, and goto Step3
◦Final Answer: {acer, wei, wei, acer, acer, [jing, acer, wei]}, min = 11
Time Complexity: O(N).
How to check whether the first word in the queue is redundant?◦Hashing to store the word’s
occurrence in the queue.