Post Outline

Purpose
Intuitive explanation
Code
Next Steps

Purpose

This is a simple reference article for readers that might wonder where I get/got my options data from. In this regard I would like to shout out the contributors to the pandas-datareader, without their efforts this process would be much more complex.

Intuitive Explanation

So this code consists of three components. The first is the actual script that wraps the pandas-datareader functions and downloads the options data. The second is a helper script to save the aggregated data to disk. The helper script which I call file_handler is designed to save the data in multiple formats in a structured file directory. Internally it checks to see if today's folder is created with a particular date and naming convention, if it isn't it will create the folder and then store all the data files there. What gives this code the ability to aggregate intraday data is the third component which simply requires making use of your system's task scheduler. For example, if you have Linux/Ubuntu you can package this script to run as a cronjob quite easily. After the code below I show an example cronjob template that works.

Code

import sys
import os
import time
PROJECT_DIR = '/YOUR/CODE/DIR/option_skew_project/'
sys.path.append(PROJECT_DIR)

from pandas_datareader.data import Options
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
import numpy as np

from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

from file_handler import file_handler

# -----------------------------------------------------------------------------
# import symbols
# -----------------------------------------------------------------------------
symbols = (pd.read_csv(PROJECT_DIR+'data/symbols.csv', header=None, index_col=False).rename(columns={0:'symbols'}))

# ------------------------------------------------------------------------------
# define conv. fn.
# ------------------------------------------------------------------------------
def cprint(df):
    print('-'*50)
    print(df.sample(5))
    print()
    print(df.info())
    print()

def random_wait():
    """fn: randomly choose a wait time based on probability"""
    wait_times = [0.2, 0.5, 1, 2]
    probs = [0.3, 0.4, 0.2, 0.1 ]
    choice = np.random.choice(wait_times, size=1, p=probs)
    return choice

# ------------------------------------------------------------------------------
# init file handler
# ------------------------------------------------------------------------------

fh = file_handler(PROJECT_DIR)

# ------------------------------------------------------------------------------
# run aggregation func
# ------------------------------------------------------------------------------
errors = []
dfs_dict = {}

for sym in tqdm(symbols.symbols.values):
    print('-'*50)
    print('downloading {} ...'.format(sym))
    try:
        tmp_df = Options(sym, 'yahoo').get_all_data()
        dfs_dict[sym] = tmp_df
    except Exception as e:
        errors.append(sym)
        print('{} error: {}'.format(sym, e))
        continue
    else:
        print('{} complete'.format(sym))
        print()
        time.sleep(random_wait())

# ------------------------------------------------------------------------------
# concat dfs drop unnecessary columns
# ------------------------------------------------------------------------------
data = (pd.concat(list(dfs_dict.values())).drop(['JSON'], axis=1))
error_series = pd.Series(errors)

cprint(data)
print(error_series)

# ------------------------------------------------------------------------------
# save data
# ------------------------------------------------------------------------------

fh.save_data(error_series, format='csv', resolution='date', errors=True)
try:
    fh.save_data(data, format='parquet')
except Exception as e:
    print(e)

fh.save_data(data, format='h5')

This is the code for the file_handler script. It can save in 1 of the following 4 formats: parquet, h5, feather, csv. I save the list of symbol errors as a CSV since this list is generally quite small. As seen above I save the options data in parquet format first, and a backup in the form of an h5 file. Generally I prefer to work with parquet files because the are compressed by default, contain metadata, and integrate better with the Dask. This code requires the installation of the pyarrow package.

import os
import pandas as pd
import numpy as np

import pyarrow as pa
import pyarrow.parquet as pq

class file_handler:
    '''
    class for handling directory/folder creation + data saving

    Attributes
        project_dir : str(), main project directory

    Methods
        save_data : actual public save function
            |__> _create_dir : internal fn to create dir if it does not exist
                |_> __check_exists_or_create : private fn to check if file exists
            |__> __create_date_str : private fn to create date str
            |__> __create_timestamp_str : private fn to create timestamp str

    '''
    def __init__(self, project_dir):
        self.project_dir = project_dir

    def __check_exists_or_create(self, _dir):
        """fn: to check if file/path exists"""
        if not os.path.exists(_dir):
            try:
                os.mkdir(_dir)
            except Exception as e:
                print(e)
        return

    def _create_dir(self):
        """fn: create daily directory if not already created"""
        _dir = self.project_dir+'/Yahoo_Options_Data/'+str(pd.to_datetime('now').date())+'/'

        self.__check_exists_or_create(_dir)
        return _dir

    def __create_timestamp_str(self):
        """fn: to create time stamp str"""
        return str(pd.to_datetime('now').tz_localize('utc').tz_convert('US/Eastern')).replace(' ', '_').replace(':','.')

    def __create_date_str(self):
        """fn: to create date str"""
        return str(pd.to_datetime('now').date())

    def save_data(self, data, format='parquet', resolution='time', errors=False):
        """fn: to save data to directory

        Args
            data : pd.DataFrame
            format : str, ('parquet', 'h5', 'csv', 'feather')
            resolution : str, date or time
                if date uses default str format,
                if time will use YYYY-MM-DD_HH.MM.SS
            errors : bool,
                if True change filepath name
                if False use options data filepath name
        """
        _dir = self._create_dir()

        if resolution=='time':
            _timestamp = self.__create_timestamp_str()
        elif resolution=='date':
            _timestamp = self.__create_date_str()

        if errors:
            _fp = _dir + f'yahoo_options_scraper_errors_{_timestamp}.{format}'
        else:
            _fp = _dir + f'yahoo_options_data_{_timestamp}.{format}'

        if format=='parquet':
            _table = pa.Table.from_pandas(data)
            pq.write_table(_table, _fp)

        elif format == 'h5': data.to_hdf(_fp, key='data')
        elif format == 'csv': data.to_csv(_fp, index=False)
        elif format == 'feather': data.to_feather(_fp)
        return

Finally, below is an example of my cronjob. It is set to run Monday through Friday, hourly, from market open to close. Note the log directory and log file after the ">>"; all the print statements contained in the script will output to that log file including any exceptions.

30 7-15 * * 1-5 /YOUR/CODE/DIR/option_skew_project/scripts/options_downloader.py' >> /YOUR/LOG/DIR/options_downloader_cronlog.log 2>&1

Next Steps

The next article will document the code I refactored to calculate the option skew metric from the paper "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. If you have been a long time reader, you may recall I did a series where I tracked a theoretical ETF equity strategy that was based on this metric. Over time, people have asked how it is performing, and I did not have an answer because I stopped tracking it, as I have been busy with other projects. However, the strategy showed promise then and I wondered if it could be applied directly in options trading. My goal is to research the possibility of implementing this strategy live, and if the results show an edge, implementing it and tracking the results publicly.

To accomplish this task I first needed to gather data which this article shows. In the next article I make heavy use of Dask because the volume of intraday data aggregated over a month is over 14 million rows and operating on the dataframe in-memory is slow and/or unfeasible on most people's systems including mine.

Additionally the next article will be a jupyter notebook I will embed as a blog post here directly, but recommend it be viewed on the github repo I will make public.

Blog

How To Get Free Intraday Options Data With Pandas-DataReader

Post Outline

Purpose

Intuitive Explanation

Code

Next Steps

BLACKARBS LLC

GET UPDATES!

@BLACKARBSCEO

ARCHIVE