Data science automation consulting seattle

2

Click here to load reader

Transcript of Data science automation consulting seattle

Page 1: Data science automation consulting seattle

Quick Tip: Data

science is more

than just algorithms

and data cleansing.

It is about creating

systems that can

replicate your

findings.

Good business

practices are key,

version control,

good

documentation and

processes can

save a team

hundreds of hours.

It also reduces the

probability a data

science project

fails!

Good Luck!

AUTOMATIONPROCESSING

DATA

Automation is a key to a data scientist?s

success. There is never enough time to manually do all the best practices required to constantly ensure high quality data

science solutions. Luckily, most of these processes are repetitive, and have a lot of

" I COULDN'T TELL YOU IN ANY DETAIL HOW MY COMPUTER WORKS. I USE IT WITH A LAYER OF

AUTOMATION-CONRAD WOLFRAM

best practices already surrounding them. For instance, from data-warehousing we have ETLs and QA suites. All though they

will require some manual intervention and planning up front, they can and should eventually be set in task manager or

crontab (or other job scheduler) and only

checked

periodically.

Data science also

has the repetitive task of analyzing and classifying basic correlations and data features. Most

of this requires the same basic algorithms and graphs and shouldn?t be a manual heavy process. Otherwise, the exploration

phase may take months.

AND

Page 2: Data science automation consulting seattle

Data Acquisition

Open source data and company data silos have become more prolific over the past decade. This has allowed for companies to take advantage of government data APIs, social media data, etc. This also means that data scientists have the opportunity to search for meaningful relationships in all sorts of data sets.

Data Quality

Good data quality means a data scientist can spend less time cleaning data and more time seeking value. It would also be beneficial to audit your data either using internal teams or hiring outside consultants.

Data Scalability

Data scientists can develop solutions that manifest themselves in

many forms. It may be a dashboard, algorithm, etc. However, one concept not always thought about by data scientists is data scalability.

Will the data scale? Does the data require manual classification? Then, your system better be automatically classifying rows, and data features.

ETL Automation Utilizing scripting languages, SSIS, or other ETL tools, data science teams should limit mannual imports to

save up to 5-30 hours a week.

QA Automation Consider creating a test suite to

automate upper and lower bounds testing, re-slicing and dicing the same data, basic

aggregation testing and tracking past data metrics

Analysis Automation The

early steps in the discovery and analysis stages of data science are pretty similar. It

involves using basic clustering algorithms, histograms, and scripts to help

detect bias, correlations, and quirks inside the data

?Data! data! data! " he

cried impatiently. "I can't

make bricks without

clay.?

? Arthur Conan Doyle

Data Processing

Data requires several preparation steps in order to become useful to a data scientist. Below is a diagram that depicts data acquisition from multiple sources, data transformation, QA and analysis. The key is to ensure your processes are both automatic and scalable. We have come across many data sets that make us cringe. Duplicate processes that create the same data that later has to be merged, missing data, and lack of QA and auditing makes it difficult to follow data flows. It can be a fun challenge! However, we don't recommend it.