I thought I would touch on the tools of the trade today. Back in the day, when I started my journey, I used a package called STATISTICA to build models. In those days, I was working for one of my professor’s building models for the state electricity provider - they needed to know whether winter would be especially cold; the colder the winter, the more power was required, and building models for the farmers, who needed to know whether the rains would come early or late in the season. I subsequently graduated and moved on to other things. For most of the past 25+ years I have used SAS to understand data and build models, although over the past 6 years or so I have used R and primarily Python.
I liken the tool I'm using to an envelope; some are fancy, some are plain, some are florescent pink, others are a super functional white, with a window - so that you can see what is going on inside. Regardless, they're all tools. There is another tool that I've used, SQL. SQL comes in many different flavours, but essentially, it is Structured Query Language. Don't misunderstand (and all the DBAs please don't scream at once!) SQL is very good at what it does, and is primarily used to manage relational databases, and it does that extremely well. Apparently, you can even do "real" analytics with SQL - I must admit, I never have. I have used SQL to extract the data required from a DB and then start working in my tool of preference.
Data Science is a maturing field, not only are the problems analysts grapple with becoming increasingly complex, but expectations of data science capabilities are increasing, and will continue to do so. This means that it is essential that your analytics teams to have the requisite tools and skills to solve the types of challenges your organisation faces. Historically, propriety software was the way to go, for example SAS or SPSS. However, as the landscape changed, so have the tools. In today's Data Analysis and Data Science world, SAS, R and Python are the three most popular tools. R and Python are both open-source and both have large communities online. There are two key benefits to open-source technology:
- The cost barrier to entry is extremely low, it’s basically non-existent.
- The community contributes to making the software better because the source code is available.
For companies looking to develop in-house teams, Python or R might be the way to go. However, for companies that already have an incumbent data science solution, this may be trickier, particularly if internal resources are not familiar with R or Python, as this may mean a lag while the team gets up to speed.
To my way of thinking the envelope, the data science solution, is only part of the puzzle, and provided it meets expectations, is it really important whether it is bright pink or white with a window?
Further Reading
コメント