An R user created a package in Python - you will not believe what happened next
I am a Data Scientist at a consulting firm in Sweden. I have a background in statistics and social science but have spent the last couple of years mainly trying to understand consumers and how they behave. I have worked with everything from making Powerpoint presentations to building recommender systems. In other words I am a pretty typical data scientist.
I started using R a couple of years ago on my first job because they didn’t want to pay for a SPSS license and thought I could do most of my work in Excel. They were of course right but just like anyone fresh from university I wanted to do cool stuff and no one does cool stuff in Excel.
My relationship with R was complicated at start. I learned all the basics through online courses but still struggled to do stuff that I could do in Excel with ease. However, after learning some powerful packages for manipulation data, mainly
data.table, both superb packages, I started to get addicted.
The more you program with R the more functions you write. At first I saved my functions in R-scripts in project folders. When I needed to reuse functions I copied them between projects. Howver, I quickly abandoned this workflow in favor of creating my own R packages. I think this represents how a lot of novice R programmers and analysts work, we learn new stuff when we need to.
This last year I have created several of my own packages, some for personal and some for company internal usage. The trickiest part for me was to understand dependencies in packages, that is when my package use other packages, which they generally do. All in all it took some days to understand this (and a lot of tutorial reading), and there were times where I became very frustrated. But once I got it going it was relatively easy, and in the end it has saved me a lot of time comparing to my previous workflow.
There are three packages in R that makes creating an R package easy:
library(devtools) library(roxygen2) library(usethis)
So for creating an R package and hosting it on Github I simply run:
create_package("~/Documents/ferrologic/example") use_git() use_github()
This creates the following package structure and adds a repository on my Github.
Now, I use R for nalytics and statistics. I am interested in programming generally but it is not my biggest motivation for using R. I am not a software developer, I and don’t intend to become one. Therefore these packages helps me a lot because they allows me to create powerful and structurized statistical software without manually setting everything up myself.
Nevertheless, it should be pointed out that the above example is only the skeleton of a package. To fill it with functions, the right dependencies and so on, has a learning curve. Fortunately it has become less steep in recent years, but it still isn’t something you usually do in your first R-week.
I haven’t had the chance or need to learn Python. Again, I mainly learn things thoroughly when I need to. However, I have recently been involved in a project that uses Python for other stuff than Data Science but would like to use the data science packages available in Python. This is a common scenario and I might write the analytics code in R nonetheless, because working between Python and R has become much more easy. However, I saw this as a chance to learn a little bit about Python and especially how to build packages(Semantics: In Python a package is called a module, when I say package I usually mean module).
The project I am involved in is working with District Heating and my first task was to build an API package in Python to get heat consumption data from an API to a pandas DataFrame.
I have worked a lot with API’s in R, so I started by making the API-calls in R. After that I moved to Python. In Python I used the package
requests and in R I used
httr (which I can tell has been influenced by
requests). I was struck by how similar the code was. Sure, there are some syntax differences but they where smaller than I anticipated.
Then I created severel functions in R, such as
get_weather_readings() and saved them in a package and put on a Git-portal for others in the project to install.
Now I wanted to do the same with Python:
- Create a package(module)
- Host it on a git portal
So for a new user it should look like this:
pip install git+https://git.pynergy.git import pynergy as pyn df = pyn.get_meter_reading(token, limit, metering_station)
R users are spoiled with the RStudio IDE. It allows us to write Notebooks, scripts and creating packages all in the same IDE. There are a number of Python IDE’s but there is no clear winner. My data science colleagues who use Python usually do DS in a Jupyter Notebook. The notebook format is great for analysis, but suboptmial for building packages.
One of my developer colleagues recommended (somewhat reluctantly because of his hate of Windows) Visual Studio Code and it’s Python extension.
VS Code is not as good as RStudio. But it is a really great IDE.
I often start learning stuff through online tutorials. Usually I walk through the tutorial and then test the code on my own data. So I started by working through this tutorial by Github which I after a couple of hours realized was deprecated… It reminded me to always check the date of tutorials. In the field of open source data science things are evolving rapidly.
I headed to the Python website but this was a little much. It’s like if I would have started with R’s official extension-site when building an R package. Both of these manuals are very well documented but are not written for beginners and not very helpful when you just want to try something.
So instead I stumbled on the Python Packaging website.
I began with this simple tutorial.
On the Python Packaging website it is stated that the structure of a Python package should be:
packagename/ packagename/ __init__.py setup.py
I had to create this structure by hand and I don’t know if there is any other way (do tell if there is).
__init__.py is where you import your fuction from a python script, similar to the
NAMESPACE in R.
To create the package you use
setup.py where you use the package
setuptools like this:
from setuptools import setup setup(name='pynergy', version='0.1', description='Easy access to Smart Energi API', url='https://git.smartenergi.org/FilipWastberg/pynergy', author='Filip Wastberg', email@example.com', license='MIT', packages=['pynergy'], zip_safe=False)
After this I could run
pip install . to install the package.
However, because I rely on three other packages,
json I couldn’t run the function unless I had these loaded. Furthermore, I could not run the function at all, because of what I guess was dependency issues.
import pynergy import pandas as pd import json import requests pynergy.get_meter_readings(token, include_from, include_to, metering_point_id, limit)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) in ----> 1 pynergy.get_meter_readings(token, include_from, include_to, metering_point_id, limit) ~/Documents/ferrologic/pynergy/pynergy/__init__.py in get_meter_readings(token, include_from, include_to, metering_point_id, limit) 13 } 14 ---> 15 r = requests.get(url, headsers = headers, params = parameters, verify=False) 16 17 data = json.loads(r.text) NameError: name 'requests' is not defined
Before I moved on to resolving these depedency issues I used this piece of code to create a distribution package:
$ python setup.py sdist
At this point I made my first commit and pushed it to my repository.
__init__.py looked like this:
from .get_functions import get_meter_readings
Now I wanted to try if I could install this package from my Git-repository:
pip install git+https://git.smartenergi.org/FilipWastberg/pynergy.git
This worked fine. Although the dependency issues (of course) remained.
To specify dependencies you have to specify them in two places:
- In the source-file
- In the setup
So in the source-file, where I saved my function,
get_functions.py I specified my dependencies like this:
from json import loads from pandas.io.json import json_normalize from requests import get from pandas import to_datetime from pandas import DataFrame def get_meter_readings(token, include_from, include_to, metering_point_id, limit): df='code to retrieve data from API' return(df)
And in the setup-file I just specified them in
from setuptools import setup setup(name='pynergy', version='0.1', description='Easy access to Smart Energi API', url='https://git.smartenergi.org/FilipWastberg/pynergy', author='Filip Wastberg', firstname.lastname@example.org', license='MIT', packages=['pynergy'], install_requires=[ 'pandas', 'json', 'requests', ], zip_safe=False,)
In order to check that this had worked I ran:
$ python setup.py develop
This yielded some permission issues that I until this day can’t figure out how to solve (working on Mac).
The following error occurred while trying to add or remove files in the installation directory: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-49833.pth'
However, I was able to resolve this by instead writing (don’t ask me about details here):
$ python setup.py develop --user
Here I stumbled on my first difference between
Python 2 and
Python 3 .
Searching for json Reading https://pypi.python.org/simple/json/ Couldn't find index page for 'json' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading https://pypi.python.org/simple/ No local packages or download links found for json error: Could not find suitable distribution for Requirement.parse('json')
This is apparently because
json is included in
Python 3. So I had to remove
json, because it is already loaded.
When running the code again a series of installations started. This took around half an hour.
When it was done I tried to install the package again from Git. This time I was able to succesfully install the package and run the function (yeeeey).
It’s easy for me to say that I prefer R for creating packages. I know the R eco system fairly well. However, and correct me if I’m wrong, I do think that it is more common to create your own packages to abstract code in R than it is in Python. But this is strictly based on my own observations of Python code. There are numerous helpful packages in R that makes making a package really easy and I struggled to find any equivilants in Python(if there are packages in Python similar to
roxygen2, do tell). The alternative to creating packages is often to copy and paste code between different projects. The simplicity of creating R packages makes us use this strategy less, which I think is great for data science.
With that said I think that the dependency handling in Python is somewhat more straigh forward than in R. One thing that I really like about Python is the ability to switch between environments. You can create a new environment that is “clean” and try out things. I also think that the structure of exporting and importing functions is easier when creating a Python package.