Data science and Longest Common subsequence Python
I need support with this Computer Science question so I can learn better.
I have attached the .txt file, the function to import the .txt file, and the recursive longest common subsequence code.
Q2) (20 points) Data Science is one of the most popular areas where Python is widely used. In this question, you will have an opportunity to put your tiny tiny first baby step into this territory. You will also develop your own simple hash table using the existing standard Python data types.
- Modify the LCS recursive version to count the number of recursive calls for each 6-digit integer string against “0123456789” string. The function returns a tuple of two elements (LCS num, the number of recursive calls to find the LCS)
- I think I found a pretty good (but very slow) hash function of integer strings, which is the recursive LCS function. 🙂 Use this function as your hash function to store the integer strings you read in Q1) into your own hash table. Please do not use Python dict() data type directly. You should develop your own hash table where the keys are the number of recursive calls as computed from the recursive LCS function.
- Now, regarding how good the new hash function is in terms of generating keys uniformly, we want to check it by counting the number of collisions for each and every key in the hash table from 1M integers. It would be very ideal if the average collision number is close to 100 for 10,000 buckets out of 1M numbers. We can get an idea of it by computing the average number of collisions, but we may also visualize the distribution of the collisions across all the keys using Python plot library.
- Use a plotting library (https://plot.ly/python/ (Links to an external site.)) to visualize the distribution of key collision of the LCS hash function:
import matplotlib.pyplot as plt
from plotly.graph_objs import Bar, Layout
from plotly import offline
# or any other library of your choice such as panda, if you prefer
- Draw an offline bar graph to show the relationship between hash keys (the number of recursive calls to compute the LCS against “0123456789” ) and the number of collisions for each hash key. You may refer to the chart that I came up with using python dict() class (I could have used collections. Counter class for that matter) but again you should use your own hash table for this homework which maps each key to the corresponding list of collisions. In any case, the final chart should look very similar, if not identical. <<
5. From my personal experience, I firmly believe that no meaningful data analysis can be completed without interjecting human intuition in the data interpretation process. Let’s exercise our own. Look at my chart mentioned above. It’s not perfect but it appears fairly uniformly distributed across the keys (LCS numbers). But it you look at it closely, after passing certain threshold, the collision rate tapering off, which means that bigger the number of recursion gets, smaller the collisions (frequencies of hash keys generated in the range) gets. They are thinly spread out. Also note that these numbers would take much longer time to get a hash key using the LCS hash function because it makes many more recursive calls to finish the recursion. So I would like to limit the number of recursive calls when it passes a certain threshold. With that in mind,
- using your intuition, from your own bar graph, pick up a threshold number and give your reasons explaining why you picked up the number. No right or wrong answer here.
- To handle those numbers that go over the threshold, you may develop and run a simple secondary hash function of your own choice to collapse those outliers to a shorter range so that the upper bound of entire hash table of yours is below or equal to 10,000.
- Then run your new algorithm to redraw the distribution across new set of keys added to the existing keys.
Since running LCS for the million numbers in rand1000000.txt file takes quite a long time, you may first test it with rand1000,txt file ( less than 10 seconds)
Q2 Deliverable :
- Python codes of modified LCS recursive and the driver program
- The offline html files of the hash table collision charts for both before and after modification