Creating Python modules like R packages

An R user created a package in Python - you will not believe what happened next

I am a Data Scientist at a consulting firm in Sweden. I have a background in statistics and social science but have spent the last couple of years mainly trying to understand consumers and how they behave. I have worked with everything from making Powerpoint presentations to building recommender systems. In other words I am a pretty typical data scientist.

I started using R a couple of years ago on my first job because they didn’t want to pay for a SPSS license and thought I could do most of my work in Excel. They were of course right but just like anyone fresh from university I wanted to do cool stuff and no one does cool stuff in Excel.

My relationship with R was complicated at start. I learned all the basics through online courses but still struggled to do stuff that I could do in Excel with ease. However, after learning some powerful packages for manipulation data, mainly dplyr pand data.table, both superb packages, I started to get addicted.

The more you program with R the more functions you write. At first I saved my functions in R-scripts in project folders. When I needed to reuse functions I copied them between projects. Howver, I quickly abandoned this workflow in favor of creating my own R packages. I think this represents how a lot of novice R programmers and analysts work, we learn new stuff when we need to.

This last year I have created several of my own packages, some for personal and some for company internal usage. The trickiest part for me was to understand dependencies in packages, that is when my package use other packages, which they generally do. All in all it took some days to understand this (and a lot of tutorial reading), and there were times where I became very frustrated. But once I got it going it was relatively easy, and in the end it has saved me a lot of time comparing to my previous workflow.

There are three packages in R that makes creating an R package easy:

library(devtools)
library(roxygen2)
library(usethis)

So for creating an R package and hosting it on Github I simply run:

create_package("~/Documents/ferrologic/example")
use_git()
use_github()

This creates the following package structure and adds a repository on my Github.

Now, I use R for nalytics and statistics. I am interested in programming generally but it is not my biggest motivation for using R. I am not a software developer, I and don’t intend to become one. Therefore these packages helps me a lot because they allows me to create powerful and structurized statistical software without manually setting everything up myself.

Nevertheless, it should be pointed out that the above example is only the skeleton of a package. To fill it with functions, the right dependencies and so on, has a learning curve. Fortunately it has become less steep in recent years, but it still isn’t something you usually do in your first R-week.

Python

I haven’t had the chance or need to learn Python. Again, I mainly learn things thoroughly when I need to. However, I have recently been involved in a project that uses Python for other stuff than Data Science but would like to use the data science packages available in Python. This is a common scenario and I might write the analytics code in R nonetheless, because working between Python and R has become much more easy. However, I saw this as a chance to learn a little bit about Python and especially how to build packages(Semantics: In Python a package is called a module, when I say package I usually mean module).

The project I am involved in is working with District Heating and my first task was to build an API package in Python to get heat consumption data from an API to a pandas DataFrame.

I have worked a lot with API’s in R, so I started by making the API-calls in R. After that I moved to Python. In Python I used the package requests and in R I used httr (which I can tell has been influenced by requests). I was struck by how similar the code was. Sure, there are some syntax differences but they where smaller than I anticipated.

Then I created severel functions in R, such as get_meter_readings() and get_weather_readings() and saved them in a package and put on a Git-portal for others in the project to install.

Now I wanted to do the same with Python:

  • Create a package(module)
  • Host it on a git portal

So for a new user it should look like this:

pip install git+https://git.pynergy.git
import pynergy as pyn

df = pyn.get_meter_reading(token, limit, metering_station)

The IDE

R users are spoiled with the RStudio IDE. It allows us to write Notebooks, scripts and creating packages all in the same IDE. There are a number of Python IDE’s but there is no clear winner. My data science colleagues who use Python usually do DS in a Jupyter Notebook. The notebook format is great for analysis, but suboptmial for building packages.

One of my developer colleagues recommended (somewhat reluctantly because of his hate of Windows) Visual Studio Code and it’s Python extension.

VS Code is not as good as RStudio. But it is a really great IDE.

The tutorials

I often start learning stuff through online tutorials. Usually I walk through the tutorial and then test the code on my own data. So I started by working through this tutorial by Github which I after a couple of hours realized was deprecated… It reminded me to always check the date of tutorials. In the field of open source data science things are evolving rapidly.

I headed to the Python website but this was a little much. It’s like if I would have started with R’s official extension-site when building an R package. Both of these manuals are very well documented but are not written for beginners and not very helpful when you just want to try something.

So instead I stumbled on the Python Packaging website.

I began with this simple tutorial.

Python packages

On the Python Packaging website it is stated that the structure of a Python package should be:

packagename/
    packagename/
        __init__.py
    setup.py

I had to create this structure by hand and I don’t know if there is any other way (do tell if there is).

__init__.py is where you import your fuction from a python script, similar to the NAMESPACE in R.

To create the package you use setup.py where you use the package setuptools like this:

from setuptools import setup

setup(name='pynergy',
      version='0.1',
      description='Easy access to Smart Energi API',
      url='https://git.smartenergi.org/FilipWastberg/pynergy',
      author='Filip Wastberg',
      author_email='[email protected]',
      license='MIT',
      packages=['pynergy'],
      zip_safe=False)

After this I could run pip install . to install the package.

However, because I rely on three other packages, pandas, requests and json I couldn’t run the function unless I had these loaded. Furthermore, I could not run the function at all, because of what I guess was dependency issues.

import pynergy
import pandas as pd
import json
import requests

pynergy.get_meter_readings(token, include_from, include_to, metering_point_id, limit)

Which yielded:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
 in 
----> 1 pynergy.get_meter_readings(token, include_from, include_to, metering_point_id, limit)

~/Documents/ferrologic/pynergy/pynergy/__init__.py in get_meter_readings(token, include_from, include_to, metering_point_id, limit)
     13        }
     14 
---> 15     r = requests.get(url, headsers = headers, params = parameters, verify=False)
     16 
     17     data = json.loads(r.text)

NameError: name 'requests' is not defined

Before I moved on to resolving these depedency issues I used this piece of code to create a distribution package:

$ python setup.py sdist

At this point I made my first commit and pushed it to my repository.

__init__.py looked like this:

from .get_functions import get_meter_readings

Now I wanted to try if I could install this package from my Git-repository:

pip install git+https://git.smartenergi.org/FilipWastberg/pynergy.git

This worked fine. Although the dependency issues (of course) remained.

Dependecies

To specify dependencies you have to specify them in two places:

  • In the source-file
  • In the setup

So in the source-file, where I saved my function, get_functions.py I specified my dependencies like this:

from json import loads
from pandas.io.json import json_normalize
from requests import get
from pandas import to_datetime
from pandas import DataFrame

def get_meter_readings(token, include_from, include_to, metering_point_id, limit):

    df='code to retrieve data from API'

  return(df)

And in the setup-file I just specified them in install_requires=[].

from setuptools import setup

setup(name='pynergy',
      version='0.1',
      description='Easy access to Smart Energi API',
      url='https://git.smartenergi.org/FilipWastberg/pynergy',
      author='Filip Wastberg',
      author_email='[email protected]',
      license='MIT',
      packages=['pynergy'],
      install_requires=[
          'pandas',
          'json',
          'requests',
      ], 
       zip_safe=False,)

In order to check that this had worked I ran:

$ python setup.py develop

This yielded some permission issues that I until this day can’t figure out how to solve (working on Mac).

The following error occurred while trying to add or remove files in the
installation directory:

    [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-49833.pth'

However, I was able to resolve this by instead writing (don’t ask me about details here):

$ python setup.py develop --user

Here I stumbled on my first difference between Python 2 and Python 3 .

Searching for json
Reading https://pypi.python.org/simple/json/
Couldn't find index page for 'json' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for json
error: Could not find suitable distribution for Requirement.parse('json')

This is apparently because json is included in Python 3. So I had to remove json, because it is already loaded.

When running the code again a series of installations started. This took around half an hour.

When it was done I tried to install the package again from Git. This time I was able to succesfully install the package and run the function (yeeeey).

Conclusion

It’s easy for me to say that I prefer R for creating packages. I know the R eco system fairly well. However, and correct me if I’m wrong, I do think that it is more common to create your own packages to abstract code in R than it is in Python. But this is strictly based on my own observations of Python code. There are numerous helpful packages in R that makes making a package really easy and I struggled to find any equivilants in Python(if there are packages in Python similar to usethis and roxygen2, do tell). The alternative to creating packages is often to copy and paste code between different projects. The simplicity of creating R packages makes us use this strategy less, which I think is great for data science.

With that said I think that the dependency handling in Python is somewhat more straigh forward than in R. One thing that I really like about Python is the ability to switch between environments. You can create a new environment that is “clean” and try out things. I also think that the structure of exporting and importing functions is easier when creating a Python package.

Dela Kommentarer
comments powered by Disqus