{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Model Data Compiler\n", "---\n", "\n", "Download all the Jupyter notebooks from: https://github.com/HeloiseS/hoki/tree/master/tutorials\n", "\n", "# Initial Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from hoki.constants import MODELS_PATH, OUTPUTS_PATH, DEFAULT_BPASS_VERSION\n", "from time import time\n", "from hoki.data_compilers import ModelDataCompiler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "The BPASS outputs contain a lot of information that is pretty much ready for analysis and can be quickly loaded through `hoki` (e.g. HR Diagrams, star counts etc...) \n", "\n", "But if you're looking to match your observation to specific stellar models present in BPASS, say a binary system with primary ZAMS 9 $M_{\\odot}$ and absolute F814w magnitude = X, then you will need to explore the large library of BPASS stellar models to get those evolution tracks.\n", "\n", "The best way to go about that is to compile all the relevant data (i.e. the IMFs and metallicities you want, for single or binary models) into one or multiple DataFrames that can then be saved into binary files.\n", "**Loading data from binary files is much faster** and it means you won't have to compile your data frome text files multiple times. We'll go over searching through the DataFrames in the notebook called \"Model Search\". Here we focus on using the `ModelDataCompiler`.\n", "\n", "The class `ModelDataCompiler` is pretty much a pipeline: Given the relevant parameters (which we'll see in a minute), it will locate the BPASS input files, read them, then fetch the BPASS stellar models one by one and combine all of this information in one sin|gle DataFrame. It can then be pickled using pre-existing `pandas` functionalities.\n", "\n", "Here is a visual summary:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Drawing\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the Data\n", "\n", "To run `ModelDataCompiler` you need two things: \n", "- An input file with a name like `input_bpass_z020_bin_imf135_300` which is in the bpass output files (folder names like `bpass_v2.2.1_imf135_300`\n", "- 50 GBs of textfiles containing BPASS stellar model library, paramters etc.. etc... The most recent ones are in `bpass-v2.2-newmodels.tar.gz` in the [Google Drive](https://drive.google.com/drive/folders/1BS2w9hpdaJeul6-YtZum--F4gxWIPYXl).\n", "\n", "The input files are, in essence, the recipe, and the 50Gb of textfiles are the ingredients. We need to read the input file to know which stars to put in, in what quantity (IMF) and at what time (if there are mergers etc..). That is all done by the pipeline ;)\n", "\n", "### 4 Easy Steps\n", "Then, follow the next steps to run your `ModelDataCompiler` pipeline:\n", "- 1) Create a list of desired metallicities\n", "- 2) Create a list of desired \"dummy\" array columns\n", "- 3) Ensure the paths to the \"model outputs\" (which contain the inputs) and to the the BPASS stellar models are correct\n", "- 4) Run `ModelDataCompiler`\n", "\n", "# Running the pipeline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Step 1\n", "# List the metallicities that you want to see in your Data Frame (same format as all BPASS metallicities)\n", "metallicity_list=['z020']" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Step 2\n", "# List the columns you want - NEED FULL LIST ONE A WEBPAGE OF READ THE DOCS\n", "cols =['age','M1','f814w']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*************************************************\n", "******* YOUR DATA IS BEING COMPILED ******\n", "*************************************************\n", "\n", "\n", "This may take a while ;)\n", "Go get yourself a cup of tea, sit back and relax\n", "I'm working for you boo!\n", "\n", "NOTE: The progress bar doesn't move smoothly - it might accelerate or slow down - it'dc perfectly normal :D\n", " |███████████████████████████████████████████████████████████████████████████████████████████████████-| 99.99% \n", "\n", "\n", "*************************************************\n", "******* JOB DONE! HAPPY SCIENCING! ******\n", "*************************************************\n", "This took 6.3 minutes\n" ] } ], "source": [ "# Step 3\n", "# Use the ModelDataCompiler pipeline\n", "start=time()\n", "myfirstcompiler = ModelDataCompiler(z_list=metallicity_list, \n", " columns=cols, \n", " # Note: The following are defualt parameters written explicitly for the \n", " # pruposes of the tutorial\n", " binary=True, single=False, \n", " models_path=MODELS_PATH, input_files_path=OUTPUTS_PATH, \n", " bpass_version=DEFAULT_BPASS_VERSION, verbose=True)\n", "\n", "print(f\"This took {round((time()-start)/60,2)} minutes\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `models_path`, `input_files_path`, `bpass_version`\n", "\n", "- `models_path` is the ABSOLUTE PATH to the top folder of the BPASS stellar models. \n", "\n", "- `input_files_path` is the ABSOLUTE PATH to the **folder** containing the input files (with names like `input_bpass_z020_bin_imf135_300`) - `ModelDataCompiler` will find the right inout files based on the other parameter information your provided\n", "\n", "- `bpass_version` is a **str** that indicates which BPASS version your stellar models are: valid options are `v221` and `v222`. Unless you **know** that you have `v222` then you're probably using `v221` and you can just use the `DEFAULT_BPASS_VERSION` (see below). \n", "\n", "\n", "### `MODELS_PATH`, `OUTPUTS_PATH`, `DEFAULT_BPASS_VERSION`\n", "\n", "All of these are `hoki` constants (they are in capital letters to make them stand out) - here is what they do and how you can update them if you want:\n", "\n", "---\n", "\n", "**`MODELS_PATH`**\n", "This is the location of the top folder containing the BPASS stellar models (the orange folder in the cartoon above). Mine is set to:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/home/fste075/BPASS_hoki_dev/bpass-v2.2-newmodels/'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MODELS_PATH" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This location is held in `hoki.data.settings.yaml` and can be changed by calling:\n", "\n", "`hoki.constants.set_models_path(path=[absolute path to the models])`\n", "\n", "Note that if you do this you will have to reload your jupyter notebook for it to work. Alternatively, just set the parameters `models_path` in `ModelDataCompiler` to the right path. \n", "\n", "---\n", "**`OUTPUTS_PATH`**\n", "\n", "Same concept but this is the default absolute path to the BPASS outputs, which contain HRDs, stellar numbers, ionizing flux information, etc... **including the input files**. In my case I haven't moved the input files outside of the output folder so that's why I'm using this default.\n", "\n", "Mine is set to:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/home/fste075/BPASS_hoki_dev/bpass_v2.2.1_imf135_300/'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "OUTPUTS_PATH" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This location is also held in `hoki.data.settings.yaml` and can be changed by calling:\n", "\n", "`hoki.constants.set_outputs_path(path=[absolute path to the outputs])`\n", "\n", "Note that if you do this you will have to reload your jupyter notebook for it to work.\n", "\n", "---\n", "\n", "**`DEFAULT_BPASS_VERSION`**\n", "\n", "This is also found in the `settings.yaml` file and is (for now) set to `v221` by default. Unless you know that you have `v222` then don't touch it. If you do want to change it though, just use `hoki.constants.set_default_bpass_version([vXYZ])`\n", "\n", "---\n", "\n", "Anyway, back to the data...\n", "\n", "\n", "# Accessing the data\n", "\n", "That's easy! Just do:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageM1f814wfilenamesmodel_imftypesmixed_imfmixed_ageinitial_BHinitial_Pz
00.000000e+0065.00000-5.621502NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
11.635020e+0364.99532-5.585636NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
22.163294e+0364.99379-5.569154NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
32.611867e+0364.99247-5.552240NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
43.002335e+0364.99132-5.535085NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
....................................
26740109.833045e+100.4228015.582510NEWSINMODS/z020/sneplot-z020-0.643.7266300NaNNaN020
26740119.865972e+100.4228015.638750NEWSINMODS/z020/sneplot-z020-0.643.7266300NaNNaN020
26740129.899617e+100.4228015.696310NEWSINMODS/z020/sneplot-z020-0.643.7266300NaNNaN020
26740139.933917e+100.4228015.755130NEWSINMODS/z020/sneplot-z020-0.643.7266300NaNNaN020
26740149.968715e+100.4228015.815090NEWSINMODS/z020/sneplot-z020-0.643.7266300NaNNaN020
\n", "

2674015 rows × 11 columns

\n", "
" ], "text/plain": [ " age M1 f814w filenames \\\n", "0 0.000000e+00 65.00000 -5.621502 NEWSINMODS/z020/sneplot-z020-65 \n", "1 1.635020e+03 64.99532 -5.585636 NEWSINMODS/z020/sneplot-z020-65 \n", "2 2.163294e+03 64.99379 -5.569154 NEWSINMODS/z020/sneplot-z020-65 \n", "3 2.611867e+03 64.99247 -5.552240 NEWSINMODS/z020/sneplot-z020-65 \n", "4 3.002335e+03 64.99132 -5.535085 NEWSINMODS/z020/sneplot-z020-65 \n", "... ... ... ... ... \n", "2674010 9.833045e+10 0.42280 15.582510 NEWSINMODS/z020/sneplot-z020-0.6 \n", "2674011 9.865972e+10 0.42280 15.638750 NEWSINMODS/z020/sneplot-z020-0.6 \n", "2674012 9.899617e+10 0.42280 15.696310 NEWSINMODS/z020/sneplot-z020-0.6 \n", "2674013 9.933917e+10 0.42280 15.755130 NEWSINMODS/z020/sneplot-z020-0.6 \n", "2674014 9.968715e+10 0.42280 15.815090 NEWSINMODS/z020/sneplot-z020-0.6 \n", "\n", " model_imf types mixed_imf mixed_age initial_BH initial_P z \n", "0 0.0778658 -1 0 0 NaN NaN 020 \n", "1 0.0778658 -1 0 0 NaN NaN 020 \n", "2 0.0778658 -1 0 0 NaN NaN 020 \n", "3 0.0778658 -1 0 0 NaN NaN 020 \n", "4 0.0778658 -1 0 0 NaN NaN 020 \n", "... ... ... ... ... ... ... ... \n", "2674010 43.7266 3 0 0 NaN NaN 020 \n", "2674011 43.7266 3 0 0 NaN NaN 020 \n", "2674012 43.7266 3 0 0 NaN NaN 020 \n", "2674013 43.7266 3 0 0 NaN NaN 020 \n", "2674014 43.7266 3 0 0 NaN NaN 020 \n", "\n", "[2674015 rows x 11 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myfirstcompiler.data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This DataFrame weighs 0.24 GB\n" ] } ], "source": [ "print(f\"This DataFrame weighs {round(sum(myfirstcompiler.data.memory_usage())/1e9,2)} GB\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is a pretty large `pandas.DataFrame`. If you want all possible columns in the BPASS models it will be about 2.5 GB (100 columns).\n", "\n", "You can definitely turn this into an astropy table if you want - personally, I'm goint to keep it that way for now. \n", "\n", "### Saving your data so you don't have to do this again\n", "\n", "Okay now that we have compiled our data we don't want to have to do it again. I'm going to show you how you can easily turn it into a binary file that you can later **load in seconds!**\n", "\n", "To avoid creating a massive file I'm going to crop it:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "data=myfirstcompiler.data.iloc[:5]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageM1f814wfilenamesmodel_imftypesmixed_imfmixed_ageinitial_BHinitial_Pz
00.00065.00000-5.621502NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
11635.02064.99532-5.585636NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
22163.29464.99379-5.569154NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
32611.86764.99247-5.552240NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
43002.33564.99132-5.535085NEWSINMODS/z020/sneplot-z020-650.0778658-100NaNNaN020
\n", "
" ], "text/plain": [ " age M1 f814w filenames model_imf \\\n", "0 0.000 65.00000 -5.621502 NEWSINMODS/z020/sneplot-z020-65 0.0778658 \n", "1 1635.020 64.99532 -5.585636 NEWSINMODS/z020/sneplot-z020-65 0.0778658 \n", "2 2163.294 64.99379 -5.569154 NEWSINMODS/z020/sneplot-z020-65 0.0778658 \n", "3 2611.867 64.99247 -5.552240 NEWSINMODS/z020/sneplot-z020-65 0.0778658 \n", "4 3002.335 64.99132 -5.535085 NEWSINMODS/z020/sneplot-z020-65 0.0778658 \n", "\n", " types mixed_imf mixed_age initial_BH initial_P z \n", "0 -1 0 0 NaN NaN 020 \n", "1 -1 0 0 NaN NaN 020 \n", "2 -1 0 0 NaN NaN 020 \n", "3 -1 0 0 NaN NaN 020 \n", "4 -1 0 0 NaN NaN 020 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "data.to_pickle('./data/tuto_data.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Already Available DataFrames\n", "\n", "Making the DataFrames takes some time, and in most cases people will be using the same IMF (kroupa 1.325 with a maximum mass = 300 solar masses). So to skip the step of compiling the data I've released a data set of pre-compiled DataFrames.\n", "\n", "Each table contains 100 columns and corresponds to **all** the data for **one metallicity**, either for **binary** models **or** for **single** star models (13 metallicities $\\times$ 2 (binary and single) = 26). Each DataFrame is a maximum of 2.5GB (a lot less for single star models) - but the size quickly goes down when you crop unnecessary data for your particular search (see the \"Model Search\" tutorial).\n", "\n", "The data set has been released on Zenodo and you can get it **[here](https://zenodo.org/record/3905388#.XvKgdHUzbmE)**\n", "\n", "**The reason this saves time** is that you'll then have all the data to hand in DataFrames and won't need to directly search the text files anymore. Even if you need to search a few metallicities, load their DataFrames one after the other (making sure to perform the biggest data cuts before you load too many DataFrames - see the \"Model Search\" notebook) and combine the matching data in each metallicity in one final table. \n", "Even **a 2.5GB dataframe in a binary file loads in a couple of seconds**. It takes me over 6 minutes to run `ModelDataCompiler` on binary models and it will take you longer if you don't have an SSD. This is a **flat fee** - even if you only want a small number of columns - that's because it's not `pandas` taking computational time, it's **reading everything from text files**, and all of these files need to be opened and read no matter how many columns you pick. \n", "\n", "If you need another IMF or really don't want to have all 100 columns, you'll need to do it yourself using the method we just went over, but unless you have a very good reason to do that you don't need to bother. \n", "\n", "---\n", "\n", "**YOU'RE ALL SET!**\n", "\n", "I hope you found this tutorial useful. If you encountered any problems, or would like to make a suggestion, feel free to open an issue on `hoki` GitHub page [here](https://github.com/HeloiseS/hoki) or on the `hoki_tutorials` GitHub [there](https://github.com/HeloiseS/hoki_tutorials)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }