Data science automation consulting seattle
Click here to load reader
-
Upload
benjamin-rogojan -
Category
Data & Analytics
-
view
83 -
download
0
Transcript of Data science automation consulting seattle
Quick Tip: Data
science is more
than just algorithms
and data cleansing.
It is about creating
systems that can
replicate your
findings.
Good business
practices are key,
version control,
good
documentation and
processes can
save a team
hundreds of hours.
It also reduces the
probability a data
science project
fails!
Good Luck!
AUTOMATIONPROCESSING
DATA
Automation is a key to a data scientist?s
success. There is never enough time to manually do all the best practices required to constantly ensure high quality data
science solutions. Luckily, most of these processes are repetitive, and have a lot of
" I COULDN'T TELL YOU IN ANY DETAIL HOW MY COMPUTER WORKS. I USE IT WITH A LAYER OF
AUTOMATION-CONRAD WOLFRAM
best practices already surrounding them. For instance, from data-warehousing we have ETLs and QA suites. All though they
will require some manual intervention and planning up front, they can and should eventually be set in task manager or
crontab (or other job scheduler) and only
checked
periodically.
Data science also
has the repetitive task of analyzing and classifying basic correlations and data features. Most
of this requires the same basic algorithms and graphs and shouldn?t be a manual heavy process. Otherwise, the exploration
phase may take months.
AND
Data Acquisition
Open source data and company data silos have become more prolific over the past decade. This has allowed for companies to take advantage of government data APIs, social media data, etc. This also means that data scientists have the opportunity to search for meaningful relationships in all sorts of data sets.
Data Quality
Good data quality means a data scientist can spend less time cleaning data and more time seeking value. It would also be beneficial to audit your data either using internal teams or hiring outside consultants.
Data Scalability
Data scientists can develop solutions that manifest themselves in
many forms. It may be a dashboard, algorithm, etc. However, one concept not always thought about by data scientists is data scalability.
Will the data scale? Does the data require manual classification? Then, your system better be automatically classifying rows, and data features.
ETL Automation Utilizing scripting languages, SSIS, or other ETL tools, data science teams should limit mannual imports to
save up to 5-30 hours a week.
QA Automation Consider creating a test suite to
automate upper and lower bounds testing, re-slicing and dicing the same data, basic
aggregation testing and tracking past data metrics
Analysis Automation The
early steps in the discovery and analysis stages of data science are pretty similar. It
involves using basic clustering algorithms, histograms, and scripts to help
detect bias, correlations, and quirks inside the data
?Data! data! data! " he
cried impatiently. "I can't
make bricks without
clay.?
? Arthur Conan Doyle
Data Processing
Data requires several preparation steps in order to become useful to a data scientist. Below is a diagram that depicts data acquisition from multiple sources, data transformation, QA and analysis. The key is to ensure your processes are both automatic and scalable. We have come across many data sets that make us cringe. Duplicate processes that create the same data that later has to be merged, missing data, and lack of QA and auditing makes it difficult to follow data flows. It can be a fun challenge! However, we don't recommend it.