Warning: Creating default object from empty value in /home3/science4/public_html/rconvert.com/Convert_SAS_To_R_Statistical_Data/wp-content/themes/optimize/functions/admin-hooks.php on line 160

SAS to R: Logging & Auditing in R Video

Video that compares SAS and R logging and demonstrates how we’ve creating logging and auditing capabilities in R.


1 Comment

SAS to R

Exploration of a common pitfall in SAS to R conversions.



Downloadable PDF: SAS_to_R_Common_Pitfalls


SAS to R

We often get asked about best practices in performing SAS to R conversions.  Below is a presentation we’ve put together to help organizations navigate switching from SAS to R.


Download the PDF: SAS_to_R_Conversion


SAS to R Conversion Assessment Tool

Rconvert, a division of Boston Decision, is pleased to report that we’ve recently completed a tool to assist with SAS to R code conversion.

Based on extensive expertise with both SAS and R programming, the tool has been developed to apply a series of best practice assessments designed to gauge the ease of conversion and detect areas where caution should be taken to ensure a successful switch from SAS to R.

Using the tool, we have begun offering a rapid SAS to R assessment, free of charge.  Interested parties may contact us via the Rconvert.com website or by calling (617)-500-0093.


1 Comment

R for Electronic Medical Record (EMR / EHR) Analysis and Data Mining – Will R Take The Lead?

Healthcare organizations are on the move, working feverishly to implement Electronic Medical Record (EMR) and Electronic Health Record (EHR) systems as part of a federal “requirement” enacted by the American Recovery and Reinvestment Act of 2009.  This requirement forces healthcare organizations to implement and make effective use of electronic medical record systems by 2015, or risk having Medicare reimbursements reduced.  In the rush to implement such systems, little attention has been focused on what may be the greatest contribution to the healthcare field of our time – analysis and data mining of such medical records to detect, better treat, and ultimately prevent illness.  We believe that R, an open-source data analysis language, is best positioned to make such analysis possible.

In fact, we predict that electronic medical record vendors will soon be embedding or otherwise implementing R into their solutions.

While the benefits of electronic medical records versus a paper-based alternative have long been documented, fewer than 50% of US health organizations had adopted such technology by 2009.  Cost and a lack of standards have been two of the major reasons for the delay in adoption.  With the federal government creating financial incentives to ensure that such technology is adopted, we believe that costs and standards will no longer represent significant barriers to entry moving forward.  Electronic medical records will be adopted.  But then what…

Data mining and analysis of electronic medical record data is the next frontier.

While the frenzy to implement EMR has led to so much attention being paid to the practice of storing medical records, little time has been spent determining how such data can be fully leveraged to improve patient care and health systems as a whole.  In a previous article, we discussed some of the implications of electronic medical records in the insurance space.  We concluded that it would initially increase malpractice costs for physicians.  Medical mistakes are easier to catch when more information is being recorded about a patient’s treatment.  However, we ultimately find that adoption of such systems will have an enormous impact on improving patient care beyond the current paper vs. digital benefits boasted by EMR vendors.

We see the major benefits originating from data mining of EMR records.  For example, what if medical records were analyzed in real-time to create more personalized medicine?  What if we could quickly measure how patients of a similar background responded to various treatment options, then use that information to help treat a current patient?  What about predicting length of hospital stay using medical record information, enabling hospitals to staff and allocate resources more effectively?

We view R, an open-source data analysis language as being positioned to make this vision a reality.  We believe R is best positioned to analyze electronic medical records for the following eight reasons:

  1. Given that standards for EMR systems are still in flux, any solution to data mining of such records should be flexible and capable of adapting to shifting EMR standards. R is positioned well for this environment, as it already integrates and connects into a plethora of database management systems.
  2. The technology will need to be capable of analyzing very large amounts of data – millions to billions of records.  R enables parallel processing and can be used in conjunction with Hadoop and other technologies to spread analysis out to distributed hardware.
  3. As the EMR space is a rapidly growing field, the analytical technology that it’s paired with should also be on a growth trajectory.  Given R is open-source, new methods and techniques are implemented into R faster than proprietary alternatives.
  4. The analytical technology should work on many different operating systems in order to service the variety of hardware/software solutions used by healthcare organizations.  R fulfills this requirement and is cross-platform.  It works on Windows, Mac, and Unix.
  5. The analytical technology should have a large user base to support the needs of the healthcare space.  R has a large, international community that includes some of the brightest minds.  R is also taught in most of the top academic statistical programs across the US.
  6. The technology must be transparent.  Once again, R is open-source, enabling anyone to go in and understand what it is doing.  Also, R is very well-documented in the literature.
  7. The technology must have very strong support for unstructured data analysis, as much of EMR data is unstructured text.  R has a list of very powerful text mining and unstructured data analysis packages / libraries.
  8. The technology needs to be affordable. R satisfies this requirement;  R is free.

Similar to the inevitability of EMR adoption by mainstream US healthcare, we view data mining of such records as the next surge.  The question is, who will be the leader in this space?  We believe it will be R.

For those interested in discussing this topic further, contact Timothy D’Auria at tdauria@bostondecision.com.


12 Reasons Users are Switching from SAS to R

According to a 2010 poll conducted by KDNuggets:

49.6% of current SAS users surveyed are considering a switch away from SAS .

Of the above, 32.8% are considering a switch to R.

Why switch to R?  Here are just a few reasons:

1. It’s completely free.  No licensing fees.  Ever!

2. Can handle big data (> gigabytes of data & millions of records).

3. Fantastic Parallel Processing

4. R is cross-platform (Windows, Mac, Unix, etc..)

5. Database Integration (Oracle, MySQL, SQLite, Access, Postgre SQL, Microsoft SQL Server, etc..)

6. Top companies like Google, Facebook, Pfizer use it.

7. R has more cutting-edge analytical approaches than any other language out there.

8. R is open-source.  Change it.  Deploy it to the web.  Your imagination is the limit.

9. Very large community – Get help fast.  Nearly all major universities have started teaching R in their classrooms.

10. R is stable.  Its predecessor, the S language, dates back to Bell Labs.

11. R is growing an breakneck speeds – if you can dream it, someone is probably writing it in R.

12. R is the leading data mining tool used by 43% of data miners according to the 2010 Rexer’s Annual Data Mining Survey.

BostonDecision.com (and our Rconvert.com division) has specific expertise in performing SAS to R conversions.  We can help guide interested firms through the pros, cons, risks, and benefits of converting.

1 Comment

R Usage Exceeds SAS, SPSS in Recent Kaggle Survey

In an article published this November, a recent survey by Kaggle uncovered that R is the most frequently listed tool in user profiles for Kaggle predictive modeling competitions.  Of the 1,714 users that participated in the survey, R was listed as a tool used by 32% of users surveyed.  Matlab finished a distance second at 13%.

Of the Kaggle users surveyed, inclusion of R in their list of tools was over 350% greater than inclusion of SAS.   Kaggle also reported that 50% of competition winners used R in their analysis.  Below are the survey results as published by Kaggle:

The above graph and the original story may be found at: http://blog.kaggle.com


SAS and R – Is SAS frightened of R?

In a January 2009 article from the NYTimes, the director for technology product marketing at SAS was quoted as saying of R, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

Unfortunately for this marketing director, I’m afraid that freeware is already being used to build aircraft engines.  I’m curious if this SAS director ever used Google Maps to get arrive safely at a destination?  I’m wondering if she ever used the internet, where, the Apache web server dominates the internet with over 60% market share as of 2011 (per Netcraft.com).  The closest proprietary competitor in the web server space is Microsoft, which sports a 20% market share.

This list of industries where freeware has overtaken commercial competitors is lengthy and beyond the scope of this article.  However, the above quote should stimulate thought within organizations about to purchase statistical computing software.  Why would a leading commercial statistical software vendor bring up price as a weakness of R (it’s free) rather than the technical merits of the technology?  Furthermore, since when does a product being open-source suddenly make it bad?  Don’t we as readers need to know more?

Having done extensive work with both SAS and R, we are in a unique position to say something about the field of commercial statistical software and how it compares to the open-source competition, namely R. Put bluntly, we believe that commercial statistical computing software languages may be in trouble… but with some caveats.  With worthy, free competitors entering the marketplace such as R, the reasons why companies would want to pay for use of an analytical programming language have been diminished.  If the open-source technology was less capable or less reliable, then this story would be different.  However, our company has found that R works equally as well (if not better) in cases where we would have historically used SAS and/or another commercial product.  For example, we’ve had the opportunity to develop and re-develop a marketing model for our clients – both in SAS and R.  Both the development time and outcome of using both SAS and R were about the same, give or take a bit.  However, R was free to use.  Additionally, due to licensing restrictions with the proprietary solution, we were able to get our R clients moving faster into exposing their models to the web.  Furthermore, we were able to get more users working with the R model since we could freely install it on as many machines as we pleased.  This would not be possible with commercial offerings without paying a price.  Cheaper and faster with the same outcome is a good deal in our industry.

So, is there any hope for commercial statistical software?

We believe the answer is yes, but will require that these vendors adapt to changing times.  Commercial statistical software vendors, SAS in particular, are in a unique position having helped clients use their software to address business challenges for more than 30 years.  People don’t buy statistical software as the finished product.  Instead, people buy statistical software to develop solutions that address problems.  Who has better knowledge about the field of statistical solutions than the vendors who created the underlying technology?  No one – this is where commercial statistical computing vendors should focus and indeed where the market is headed.

We see analytical vendors who focus on solutions as being in a very good position going forward.  However, there is one caveat.  Focus will be key here.  But what do we mean by focus?  Vendors who offer thousands of different products to disparate markets will likely fall behind.  Most customers are not looking for vendors that can be everything to everyone.  Most customers are looking for the vendor that best understands and can solve their specific problem.  Vendors need to show discipline, restraint, and sharp resolution in their product offering.  Apple Inc has mastered this field.  While Apple has many other traits that enabled it to become one of the most successful technology companies ever to exist, Steve Jobs always had the restraint to only focus on a small set of problems that his company could become best at solving.

In summary, we see R and open-source technologies becoming the standard for statistical computing over the next ten years.  However, we believe that this shift has created new opportunities for proprietary statistical vendors, and that these vendors are best-positioned to embrace these opportunities – namely in the solutions market.   To be successful, these opportunities will likely require strong discipline and focus to ensure customers are being delivered crisp, sharp solutions that attack their problems head-on.


5 Reasons to Worry if You’re Using Proprietary Statistical Computing Software

Lately, the marketing engines of several large statistical software companies have been hard at work trying to spread fear and desperately convince businesses why they should not switch to R – a free, open-source statistical computing technology that is taking the financial, healthcare, insurance, and other industries by storm.

Given our experience with both proprietary technologies like SAS, and the open-source competitor R, we wanted to unveil the other side of the coin that the major software firms don’t want people to see.  Below are 5 reasons to be concerned if your company is using licensed data and statistical computing software.

1. If you miss a renewal bill payment, all your work may… stop working.

Picture spending 2, 5, or 10 years developing on a language that is licensed annually.  Now imagine one day, you log into your workstation, only to find that all of the programs you have written suddenly stopped working!  This is a very true story and a risk companies face everyday when paying to license a software language. What happens if one day, your company runs into some troubled times and cannot afford to pay tens of thousands (perhaps hundreds of thousands) of dollars to renew a license for the software.  You are pretty much stuck.

2. Unexpected Licensing Cost Jumps

When a company develops using a licensed statistical computing technology, particularly one that requires annual renewal fees for continued use, the company has forced itself into a position where it needs to continue to pay these licensing fees in order to continue using what it built.  I’ve seen many cases where paying these fees becomes a lifeline for a company.  Particularly in cases where there has been vast development on top of a licensed language, the company’s only way of surviving may be to pay these fees.  What would happen if this “survival fee” were to jump 10%, 20% or 60% in a given year (the latter is what Netflix did to its customers according to Bloomberg)?  Can you afford to risk your existence on uncertain, increasing future licensing costs?

On this point, I know some may argue that this happens all the time in business; for example, a company may license a database system.  However, there is a big difference here.  In cases of database platforms, servers, routers, word processing programs, and other technologies, the technology is not serving as a critical building block.  Sure there is the cost of setup and training.  However, it would not be inconceivable for a company to replace one of these resources.  This is not the same as a statistical computing platform.  Once an organization begins developing in a particular language, it becomes a part of their DNA.  Everything that gets built and all the intellectual property that is brought into existence is now contingent of paying licensing fees.  Replacing an arm or a leg may be doable, but replacing a company’s DNA would be a substantial challenge.

3. Mergers and Acquisitions

In the field of data technologies, vendors are changing all the time, and the seniority of the vendor is little assurance of its future stability.  For example, SPSS, a 40-year-old-plus statistical computing vendor whose software was initially released in 1968, was acquired in 2007 by IBM.   While this may not seem like a big deal, such changes can be particularly worrisome in the field of statistical computing.  If you own a car and the auto manufacturer was to be acquired, perhaps the greatest question a current owner might face is whether his warranty would still be honored.  However, the implications of such a merger in statistical computing are far more troubling.  If you develop in a particular computing language that requires licensing, there is no telling what level of support a new owner for your recently-acquired vendor might provide you – if any.  It’s like building a house on a piece of land that is owned and traded by someone else.  The worst part is that you have no say in how the land is traded!

4. Bankruptcies

Major corporations that have shaped the world we live in are going out of business.  Boston Scientific and Eastman Kodak are just two examples of technology companies that are in risk of bankruptcy in the near future according to the Business Insider.   What would happen if the vendor of our statistical computing software that is at the root of our company development were to find themselves in financially-troubling times?  Is there any guarantee that the engine that drives our company will still be able to run tomorrow?

5. Limits Business Expansion and Scalability

A major challenge I’ve encountered with clients who use proprietary statistical computing software is difficulty they encounter when they want to put their developed products or services online and/or connect them with other technologies.  While the technology may be available to make this expansion possible, expect to pay an arm and a leg for it (possibly taken from item 2 above).  Particularly hard-hit are companies that wish to get paid to crunch other people’s data.  Generally, this capability falls under special licensing provisions that may be so pricey as to make any CEO/CTO keel over.  I’d estimate that 30% of the clients I’ve worked with have run into this issue, causing them to either abandon their expansion plans or to use a different technology (such as R) to make it possible.

In summary, the next time you hear a statistical computing vendor talk down about the risks of open-source technologies like R,  be sure to consider what the vendor is not telling you.


R versus SAS – A Summary List

In recent discussion forums like the Advanced Business Analytics, Data Mining and Predictive Modeling group on LinkedIn, there has been much discussion on the pros and cons of each technology for statistical computing, particularly with respect to their applications in business.  Below is a summary of some of the points made across this discussion:

1. Data Set Size Limitations

One of the great myths in the field is that R is limited in the size of the data sets it can analyze compared to SAS.  This is FALSE.  In fact, similar to SAS, the size of the data sets that may be analyzed by R are limited only by the physical machine.

There is an important difference related to how SAS and R natively handle data.  Simply stated, the size of data sets analyzed in SAS are generally bottlenecked by the size of the hard disk, whereas data sets analyzed out-of-the-box in R are bottlenecked by the size of the RAM – more on this in a minute.  Both are physical hardware limitations of the machine and not limitations in the software.  However, hard drive capacities have historically grown faster than support for RAM, giving R the bad rap of being limited.  Two key developments make this topic moot.  With the appearance of 64-bit operating systems that support far more RAM and with R connectivity to databases, R can be made to support any size data set that SAS can support, including those that contain billions of observations.

Our organization has used R to analyze and model datasets that contains millions of rows, scores of variables, and take up gigabytes of hard drive space without issue.

2. Open-source versus Closed

R is open-source while SAS is proprietary and closed.  The SAS marketing engine has historically argued that open-source technology should not be trusted and carries additional inherent risks.  In fact, a 2009 New York Times article by Ashlee Vance entitled, “Data Analysts Captivated by R’s Power,” quoted Anne H. Milley, director of technology product marketing at SAS as saying, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

Having used both technologies for sometime, in my opinion, Ms. Milley’s view of R is misinformed.  Much of the field agrees with my latter view.  As SAS/R expert Steve Miller said of R at the Information Management blog, “I’ve never worked with a more stable, bug-free piece of software.” Similar to others who have worked with both SAS and R, my general experience with R is that it is extremely stable.

3. Corporate Usage

It has been argued that SAS is for businesses and R is for academics.  However, the evidence tells a different story.  In fact, some of the most pioneering companies of our time use R as part of their daily operations.  Google, Facebook, and Pfizer are just a few of the names who actively publicize their use of R.

Our company is particularly knowledgeable in this area, as we’ve helped companies make the conversion from SAS to R; this movement is not unique.  Indeed, major companies across all industries are turning to R as their lead statistical computing technology.

4. Academic Usage

While SAS was once a critical component of many (if not most) graduate-level statistics programs, SAS’s position is slowly being usurped by R at many top-level institutions.  I recall my days at Cornell and the chair of the mathematics department once telling me that no student would be in want of a job if he/she knew SAS.  While that was true at the time (and still is to some extent today), R is expanding in this area as more universities structure their programs with R at their core.

5. Lag Time for Implementation of New Methods

SAS proponents have argued that it takes time to vet and test new methods before implementing them into corporate software.  In fact, the marketing engines of several proprietary analytical vendors have stated this as a point of pride and competitive advantage for their technology.

Frankly, I don’t see how lag time can be an advantage in the field of statistical computing.  Particularly in finance, healthcare and other competitive fields, I’d much rather have a vehicle to access both the latest and the tried-and-true techniques than be left in the dust to inhale my competitor’s fumes.  This is probably why so many companies are switching or considering a conversion to R in the near future.

6. Cost

In being open-source, R is completely free.  Unlike those who license proprietary statistical technologies, the ability to use work won’t be lost if you don’t pay your annual renewal fees.  That’s because R doesn’t have any fees.  Shifting to R is perhaps one of the greatest risk-mitigation strategies a business could take in this arena, as once you have a copy of R, it is yours to keep for all eternity.  R is not going away.

7. Documentation

Both SAS and R are well documented.  SAS arguably has more documentation since it has been around longer.  However, a search of Amazon.com will reveal more R books than one would likely want to read.

There are many additional topics that could be discussed when doing a comparison between R and SAS.  However, this summary has highlighted some of the areas where there is conflict between reality and marketing spin.  Further discussion on this and similar topics are to follow.