# Python Programming for Scientists - Day 2

In the second day we will consider:

* formatting print output
* reading/writing text files
* reading/writing binary (HDF5) files
* modules
* installing new modules with pip
* the standard library

# [1] Input and Output

### Formatting (print) output

There are several ways to present the output of a program: data can be printed in a human-readable form, or written to a file for future use. So far we have seen the `print()` function:

In [1]:
x = 5
print('starting')
print(x)
print('x = ', x, ' which is great!')

starting
5
x =  5  which is great!


We can print strings and variables, which is enough for some simple use cases, but there are three ways to have more control with print:
1. "Old" formatting with `%`
2. The `.format()` method
3. "f-strings"

The first, "old formatting", is easy to use:

In [2]:
y = 2.2
print('x = %d' % x)
print('x = %d, y = %f' % (x,y))

x = 5
x = 5, y = 2.200000


The expressions `%d` and `%f` are examples of [format codes](https://docs.python.org/3/library/string.html#formatspec), which tell the `print()` function that this should be replaced by the value of a variable.

The variable(s) are specified listing them using the format: `string % (var1,var2,var3)`.

The most common and useful formatting codes are:
* `%s` for a string
* `%d` for an integer number
* `%4d` for an integer number, padding it with spaces to be at least 4 characters long.
* `%04d` for an integer number, padding it with zeros to be at least 4 characters long.
* `%e` scientific notation, for any floating number.
* `%f` for a floating number, automatic number of digits shown.
* `%.2f%` for a floating number, showing two digits after the decimal place.
* `%8f` for a floating number, showing eight digits in total.
* `%6.2f` for a floating number, showing six digits in total, with two after the decimal place, leaving three before the decimal place (which counts as one character).

### Exercise

Assign the number 123 to a variable and print it padded to five characters with zeros. Then, print it using the `%s` format code - what happens, and why?

Assign the number 4.2311e2 to a variable, and print it with three different representations.

In [4]:
# your solution here
z=123
print('z = %5d' % z)
print('z = %s' %z)

z =   123
z = 123


In [7]:
a=4.2311e2
print('a=%3d' %a)
print('a=%f' %a)
print('a=%.2f' %a)

a=423
a=423.110000
a=423.11


The second method uses the same format codes, but in a slightly different syntax with `.format()`:

In [8]:
print('We say {} and think that x = {}'.format('thanks',x))

We say thanks and think that x = 5


In [9]:
print('We say {word} and think that x = {myvar}'.format(word='hi',myvar=x))

We say hi and think that x = 5


In [10]:
print('The second entry is {1} while the first is {0}'.format(x,y))

The second entry is 2.2 while the first is 5


In [11]:
ages = {'tom':22, 'sam':23, 'philip':18}
print('Tom is {tom:d} years old, while Philip is {philip:d}'.format(**ages))

Tom is 22 years old, while Philip is 18


The last method is using [f-strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) (i.e. "formatted strings"). This is a modern and suggested way to do formatted output:

In [12]:
word = 'never!'
print(f'We say {word} and think that x = {x}')

We say never! and think that x = 5


In [13]:
print(f'We chose {word = } and know that {x = }')

We chose word = 'never!' and know that x = 5


In [14]:
print(f'{x = :4d} and {y = :.2f}, and also y = {y:.3f}')

x =    5 and y = 2.20, and also y = 2.200


Of the three methods, you can use whichever seems the most natural and easiest to read.

### Exercise

Write a function that takes six integer inputs: (year, month, day, hour, minutes, and seconds) and converts them to a string with format `2006-03-22 13:12:55`.

In [21]:
# your solution here
date = {'year':2006,'month':'03','day':22,'hour':13,'minutes':12,'secondes':55}
print('{year:d}-{month}-{day:d} {hour:d}:{minutes:d}:{secondes:d}'.format(**date))

2006-03-22 13:12:55


### Exercise

Write a function that takes a string formatted as `2006-03-22 13:12:55` and returns a 6-tuple of integers giving the corresponding year, month, day, hour, minutes, and seconds.

In [22]:
# your solution here


## Reading (text) files

So far we have only been working with variables and data which we have explicitly written. In data processing and data analysis, most often you will need to load data from "external" files (on the filesystem).

The simplest case, which always works well for very small datasets, is a simple text file. When the file contains values separated by spaces or commas, we call this a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values) (comma-separated value).

There are many ways to read such files in Python. First, we can use the built-in `open()` function.

In [23]:
f = open('data/day2_stars.txt','r')

The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be `'r'` when the file will only be read, `'w'` for only writing (an existing file with the same name will be erased), and `'a'` opens the file for appending; any data written to the file is automatically added to the end. `'r+'` opens the file for both reading and writing.

Normally, files are opened in "text mode", that means, you read and write strings from and to the file. You can append `'b'` to the mode to open a file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.

In [24]:
# f is an 'open' file object, we should use it to read, and then close
lines = f.readlines()
f.close()

In [25]:
len(lines)

10

In [26]:
lines

['RA         DEC        NAME (ID)         Jmag   e_Jmag\n',
 '(deg)      (deg)                        (mag)  (mag) \n',
 '---------- ---------- ----------------- ------ ------\n',
 '010.684737 +41.269035 00424433+4116085   9.453  0.052\n',
 '010.683469 +41.268585 00424403+4116069   9.321  0.022\n',
 '010.685657 +41.269550 00424455+4116103  10.773  0.069\n',
 '010.686026 +41.269226 00424464+4116092   9.299  0.063\n',
 '010.683465 +41.269676 00424403+4116108  11.507  0.056\n',
 '010.686015 +41.269630 00424464+4116106   9.399  0.045\n',
 '010.685270 +41.267124 00424446+4116016  12.070  0.035\n']

In Python it is good practice to use the **with keyword** when dealing with file objects. The advantage is that the file is guaranteed to be properly closed, even if an error occurs at some point while reading. (You should always make sure to close a file object, otherwise problems!)

In [27]:
with open('data/day2_stars.txt','r') as f:
    lines = f.readlines()

The syntax works as follows: the `open()` function is run, and its return is assigned to the variable `f`. The indented code block which follows can then use this `f` file object. As soon as the indented code is finished, the file is automatically closed.

Note that `readlines()` simply reads the entire file, one line at a time, and returns a list, where each item is a single line.

### Exercise

Two other functions exist, `readline()` and `read()`. Try each, loading the same file as above, and compare the result.

In [30]:
# your solution here
f = open('data/day2_stars.txt','r')
line=f.readline()
read=f.read()

Another very pythonic way to read the lines of a file is to loop over them:

In [31]:
with open('data/day2_stars.txt','r') as f:
    for line in f:
        print(line)

RA         DEC        NAME (ID)         Jmag   e_Jmag

(deg)      (deg)                        (mag)  (mag) 

---------- ---------- ----------------- ------ ------

010.684737 +41.269035 00424433+4116085   9.453  0.052

010.683469 +41.268585 00424403+4116069   9.321  0.022

010.685657 +41.269550 00424455+4116103  10.773  0.069

010.686026 +41.269226 00424464+4116092   9.299  0.063

010.683465 +41.269676 00424403+4116108  11.507  0.056

010.686015 +41.269630 00424464+4116106   9.399  0.045

010.685270 +41.267124 00424446+4116016  12.070  0.035



## List (and dict) comprehension

A useful shorthand in Python, which can often save you from writing a loop to construct a list a dict, is called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). The syntax looks like:

    [item for item in iterable_object] makes a list
    {key:val for key in iterable_object} makes a dict
    
For example:

In [32]:
squares = [i**2 for i in range(10)]
print(squares)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [33]:
my_dict = {i:'ok' for i in range(5)}
print(my_dict)

{0: 'ok', 1: 'ok', 2: 'ok', 3: 'ok', 4: 'ok'}


In [34]:
my_dict = {i:i**3 for i in range(5)}
print(my_dict)

{0: 0, 1: 1, 2: 8, 3: 27, 4: 64}


Don't go crazy with list comprehensions - you will often see Python experts writing this in such complex ways they are impossible to understand. Used in moderation, however, they can be quite helpful.

Importantly, "iterable_object" is an idea you will often see in Python. It means anything which can return one element at a time, i.e. which can be "iterated" over. Lists and generators are two examples, as are file-like objects:

In [35]:
with open('data/day2_stars.txt','r') as f:
    lines = [line for line in f]
    
print(len(lines))

10


Regardless, we have the data "loaded", but so far it is just a list of strings, one string per line.

### Exercise

Construct a list which contains only (and all of) the `Jmag` values. Then, compute the sum. Hint: loop over the lines (strings) we have loaded, use `.split()` to split each (using whitespace). 

In [36]:
# your solution here


### Exercise

Construct a dictionary, which stores the RA ([right ascension](https://en.wikipedia.org/wiki/Right_ascension), i.e. position on the sky) as the values, using the NAME as the key.

In [37]:
# your solution here


## Writing (text) files

Instead of reading a text file, we can write to a text file, simply by changing the mode from `'r'` to `'w'` and using the `.write(string)` method:

In [38]:
with open('output.txt','w') as f:
    f.write("just a little test\n")
    f.write("and some more")

As above, we can also use `writelines()` to write a list of strings all at once.

In [39]:
with open('output.txt','w') as f:
    f.writelines([str(x) for x in squares])

### Exercise

Open up the "output.txt" file we have just made - use either a terminal, or the file explorer on the left. What's wrong? Fix it.

In [41]:
# your solution here
open('output.txt','w')

<_io.TextIOWrapper name='output.txt' mode='w' encoding='UTF-8'>

### Exercise

Look at the text file `data/day2_gallazzi.txt`, identify the rows which are header (metadata), the number of columns, and so on.

Load the file, and write a text file which contains only the stellar mass $M_\star$ (first column) and metallicity $Z$ (second column) from the original file.

In [42]:
# your solution here


## Reading and writing (large) binary data

For large numeric datasets, you rarely will want to read or write these as text (strings): the resulting files are larger to store, and precision may be lost if not enough digits are stored, for example.

The alternative is to write "binary" data (meaning it is just a series of bytes representing numbers, rather than representing strings).

In python you can read binary by changing the mode from `'r'` to `'rb'`:

In [43]:
with open('data/day2_numbers.bin','rb') as f:
    data_bytes = f.read()

In [44]:
print(data_bytes)

b'\x00\x01\x04\t\x10\x19$1@Q'


The contents of this file are just a series of bytes, which are impossible interpret without some prior knowledge. If we know that each byte encodes the value of a single integer, we can actually use the data:

In [None]:
for byte in data_bytes:
    print(int(byte))

In [None]:
data_numbers = [int(byte) for byte in data_bytes]
print(data_numbers)

Such files are called "raw binary", and they are difficult to work with. Ahead of time, you have no idea what the contents of the file are, nor how to read it correctly, unless it comes along with an exact description of how data has been arranged in the file.

In science and scientific computing, there are a number of binary data formats in common use - this depends on the particular field. Some common formats are:
* [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) in general, especially for simulations.
* [FITS](https://en.wikipedia.org/wiki/FITS) in astrophysics.
* [CDF](https://cdf.gsfc.nasa.gov/) in space sciences, earth sciences.
* many others...

These are all "self-describing" binary formats, meaning that they adhere to a well-known standard, so that you can understand the structure of a given file without any prior knowledge.

In a single file they can store multiple datasets e.g. 1D arrays, 2D arrays, 3D arrays, etc, together with metadata (number of elements in an array, its physical units, and so on).

## HDF5


Let's look at a quick example with **HDF5**. First, we import the [h5py Python library](https://docs.h5py.org/en/stable/index.html) used to read and write HDF5 files:

In [1]:
import h5py

First, let's **write** (create) a new HDF5 file:

In [2]:
with h5py.File('test.hdf5','w') as f:
    f['dataset1'] = squares
    f['more_data'] = [float(sq) for sq in squares]
    f['a_third_dataset'] = np.array([33,22,11])

NameError: name 'squares' is not defined

We have just created a new binary file named `test.hdf5` which contains two one-dimensional datasets (i.e. arrays).

Now let's **read** an existing HDF5 file:

### Exercise

Open a terminal, and type `h5ls test.hdf5` to list the contents of this file. Try `h5ls -r test.hdf5` and `h5ls -rv test.hdf5` as well, for "recursive" and "verbose".
How many datasets are there? How many entries in each? What is the data type of each?

In the notebook, use `h5py` to read the first number from the `more_data` dataset in our newly created file.

In [None]:
# your solution here


What if we don't know what datasets are in the file, or what their names are? We can use `.keys()`, just as with a dictionary.

In [None]:
with h5py.File('test.hdf5','r') as f:
    dset_names = list(f.keys())

print(dset_names)

In general, we can read an entire dataset using the following syntax:

In [None]:
with h5py.File('test.hdf5','r') as f:
    entire_dataset = f['more_data'][()]

But one of the powerful aspects of HDF5 and similar binary data formats is that you can load specific subsets of data.

Imagine that you had an enormous 100GB data file, which is too large to load into the memory all at once. To load only the second through fifth entries:

In [None]:
with h5py.File('test.hdf5','r') as f:
    data_subset = f['more_data'][1:5]
    
print(data_subset)

Notice how we are using the same indexing and slicing syntax that we have seen already, to tell h5py what subset of the dataset to load.

> HDF5 is a rich format with many other features - take a look at the [h5py quickstart guide](https://docs.h5py.org/en/stable/quick.html#quick).

## pickle

Often you just want to save some variable in python to a file, so that you can load it again later (or send it to a colleague).

The `pickle` module makes this possible: it can convert -any- Python object (e.g. variable) into a piece of data which can be saved, and loaded:

In [None]:
import pickle

In [None]:
x = {'a':'Some data','b':'Additional data'}
y = [x, 33]

# save
with open('test.pickle','wb') as f:
    pickle.dump(y,f)

In [None]:
# load
with open('test.pickle','rb') as f:
    y_loaded = pickle.load(f)
print(y_loaded)

## Accessing remote (online) data using APIs

It is becoming increasingly popular, and powerful, to access data from remote, online resources (that is, websites).

Services providing this functionality are called APIs ("application program interfaces"). For example:

* You could use the [Twitter API](https://developer.twitter.com/en/docs) to search and retrieve all tweets containing a certain word posted on a certain day.
* You could use a [RKI COVID API](https://api.corona-zahlen.org/docs/) to retrieve details on coronavirus infections in Germany.
* You could retrieve historical data on the price of Bitcoin from the [Coinbase API](https://docs.cloud.coinbase.com/).

APIs can not only provide data, but also services. For example:

* You could use a [machine-learning image recognition API](https://docs.imagga.com/#introduction) to classify what an image (that you send) contains.

Modern web APIs are almost always based on [REST](https://en.wikipedia.org/wiki/Representational_state_transfer), which simply means that:

1. You send and recieve information with a standard "web request" to a particular URL (the same as going to the URL in your browser).
2. Each request is "stateless", meaning that it is completely independent from any prior or future requests.

They are very easy to use. Just to get a feel, let's look at the [randomuser.me API](https://randomuser.me). This is just a "toy" API for understanding the concepts. It generates random profiles of people.

In [None]:
import requests

The [requests library](https://docs.python-requests.org/en/latest/) makes it very easy to make "web" (HTTP) requests in Python.

In [None]:
# specify the specific API URL to access
url = "http://randomuser.me/api/"

# make request
response = requests.get(url)

We can check that the [response code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) indicates success (200).

In [None]:
response.status_code

Then we can look at the actual text of the response:

In [None]:
response.text

You can see that this looks a bit like a dictionary, with a key called `results`. In fact, this is a string which is encoded in [JSON](https://en.wikipedia.org/wiki/JSON) format.

**JSON** is a very common "human-readable" text format for sending and receiving data in Python, particularly across the web.

In this case `requests` can automatically decode this response text:

In [None]:
response.json()

In [None]:
type(response.json())

The result is, in fact, a normal Python dictionary.

In [None]:
r = response.json()
type(r['results'])

In [None]:
len(r['results'])

In [None]:
r['results'][0].keys()

### Exercise

Make five requests to this API. For each, print out the name and phone number of the (automatically generated) person.

In [None]:
# your solution here


# [2] Modules

If you exit the Python interpreter and start it again (or select "Restart Kernel" in a Jupyter notebook), the definitions you have made (functions and variables) are lost.

Therefore, if you want to write a longer and more complex program, the standard approach is to use a text editor to prepare the series of commands, and run that instead. This is known as creating a script.

As your program gets longer, you will want to split it into several files for easier maintenance. You may also want to use a common function, written once, in several different programs, without copying its definition into each program.

To achieve this, we create a Python **module**. Definitions from a module can be imported into other programs, scripts, or notebooks.

We have already seen examples of module imports:

In [None]:
import math

# we can then use the math.cos() function of the math module
math.cos(0)

If you want to import a specific function or definition from a module:

In [None]:
from math import cos

# we can then use the cos() function without any prefix
cos(0)

Although we can import every definition from a module into the current program, this is considered poor form and should **never** be done:

In [None]:
from math import * # puts cos, sin, and hundreds of other names into the current program, could overwrite things

We can also rename a module, or a specific function from a module, when we import it:

In [None]:
from math import cos as my_cos # perhaps to avoid a conflict

In [None]:
import numpy as np # just for convenience

## Making a custom module

A "module", in its simplest form, is just a file with a `.py` extension.

### Exercise

Create a new file in the current directory (use the terminal, or right-click on the file explorer and select "New File"), name it `fibonacci.py`. Paste the following contents into it:

    # Fibonacci numbers module

    def fib(n):    # write Fibonacci series up to n
        a, b = 0, 1
        while a < n:
            print(a, end=' ')
            a, b = b, a+b
        print()

Then, we should be able to import it, and run it:

In [None]:
import fibonacci

In [None]:
fibonacci.fib(10)

A module can contain executable statements, as well as function definitions. Any executable statements are intended to initialize the module (generally, don't do this!), and are only executed the first time the module is imported.

Modules can import other modules. It is customary but not required to place all import statements at the beginning of a module (or script, for that matter).

## Running a custom module from the command-line

You may want to run a piece of python code in the terminal like `python fibonacci.py <arguments>`. This may be useful for testing, for submitting the code to a compute cluster, and so on.

To do this, we need to add the following code to the bottom of our `fibonacci.py` file:

    if __name__ == "__main__":
        import sys
        fib(int(sys.argv[1]))
        
What is going on? The `if` statement is not within a function, so it is immediately executed when the module is imported, or when the file is run with the command `python fibonacci.py`.

When the module is imported, the esoteric sounding condition `__name__ == "__main__"` is False, and nothing happens. But when run from the command line, this is True, in which case the `fib()` function is run, and the first command-line argument is passed as the argument for this function.

### Exercise

Add this code to the bottom of our module file, and use the command-line (File -> New -> Terminal) to run it.

> Note on how python finds modules: When you type `import mymodule`, python first searches for a built-in module with that name. If not found, it then searches for a file named `mymodule.py` in a list of directories given by the variable sys.path, which contains these locations:
> * The directory containing the input script (or the current directory when no file is specified).
> * The `PYTHONPATH` environment variable (a list of directory names, with the same syntax as the shell variable PATH).
> * The default search paths for your python installation.

## Installing modules

You will often want to install a module (or library) that you've found online, with some functionality that you want to use.

There are a few different ways to do it, depending on the situation (these are terminal commands, not in a notebook):

1. From source code, i.e. found in a github repository:

> git clone https://www.github.com/username/module_name
>
> cd module_name
>
> python setup.py --install --user

2. From the [pypi](https://pypi.org/) package index (most common):

> **pip install --user module_name**

3. If you are using an [Anaconda](https://www.anaconda.com/products/individual) installation of python:

> conda install module_name

Note: the `--user` option is asking that the package be installed locally, in your home directory, rather than "system-wide", which is impossible on any shared computer/cluster. (You could install packages system-wide on your personal computer, but this is bad practice).

## If you need to install pip:

Most typical python installations, on clusters or your own system, will have the `pip` command available. This is a package manager, i.e. it installs, upgrades, uninstalls, and keeps track of libraries on the [pypi index](https://pypi.org/).

On the KIP JupyterLab server, `pip` is not available by default. We can install it:

1. Open a terminal (File -> New -> Terminal).
2. Type `wget https://bootstrap.pypa.io/get-pip.py` (download the install script).
3. Type `python get-pip.py`.

You will notice a message about the pip executable not being on the current PATH.

1. Type `echo 'export PATH=/cipuser/zah/wu533/.local/bin:$PATH' >> .bash_profile` (**change the path /zah/wu533/** to your username. this adds this path to the PATH environment variable, telling the OS where to search for executables).
2. Type `source .bash_profile` (to execute this command immediately, it will automatically be done in the future when you log in).

After this, we could do:

    pip install --user memory_profiler
    
(we will use this package on day 5).

## Conda environments, installing packages with `conda`

It is fairly common these days to use **Anaconda** to install a working python environment. Anaconda combines at least two features: independent "environments", and a nice package manager (like pip).

The first step, done only once, is to create an empty environment (optionally, with a specific version):

```python
    conda create --prefix=~/.local/envs/myenv python=3.9
```

Then you "activate" the environment (usually, by adding this line to your `.bashrc` file, so that it is always active when you start):

```bash
    source activate ~/.local/envs/myenv
```

Then, any packages you install, and e.g. their specific versions, are contained within this environment:

```bash
    conda install package_name
```

This is nice, for example, if you are working on more than one project, but different projects require different libraries, or conflicting versions of libraries -- you can keep two separate environments.

# [3] The "standard library"

Every python installation comes with a number of [standard library](https://docs.python.org/3/library/index.html) modules, many of which are essential to use. Let's explore some of the most important:

## Files, directories, and filesystem interaction

The `os` module provides dozens of functions for interacting with the operating system:

In [None]:
import os

In [None]:
os.getcwd()

In [None]:
os.path.isfile('test.hdf5')

In [None]:
os.rename('test.hdf5','test_new.hdf5')

The `glob` module can search for files or directories, also using wildcards:

In [None]:
import glob

In [None]:
files = glob.glob('*.ipynb')
for file in files:
    print(file)

## Dates and times

The `datetime` module supplies classes for manipulating dates and times, including handling timezones.

In [None]:
import datetime

In [None]:
now = datetime.date.today()
print(now)

In [None]:
now.strftime("%m-%d-%y. %d %b %Y is a %A on the %d day of %B.")

The `time` module is useful for recording how long it takes for a piece of code to run (i.e. for performance benchmarking).

In [None]:
import time

In [None]:
start_time = time.time()

for i in range(100000):
    x = [j**2 for j in range(10)]
    
print('Execution took %.2f seconds.' % (time.time()-start_time))

## Mathematics

> Note: the standard math/random/stats libraries are rarely used in practice, since numpy (tomorrow!) is better.

The `math` module gives access to the underlying C library functions for floating point math:

In [None]:
import math

In [None]:
math.cos(math.pi / 4)

In [None]:
math.log10(100)

The `random` module provides tools for making random selections:

In [None]:
import random

In [None]:
random.choice(['apple', 'pear', 'banana'])

In [None]:
random.random() # float between 0.0 and 1.0

The `statistics` module calculates basic statistical properties (the mean, median, variance, etc.) of numeric data:

In [None]:
import statistics

In [None]:
data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
statistics.median(data)

### Exercise

Does the `math.cos` funtion take radians or degrees? Are there functions that can convert between radians and degrees? Use these to find the cosine of 60 degrees, and the sine of pi/6 radians.

> Hint: in a Jupyter notebook, or in an IPython console, type `math.` and then hit the `TAB` key. This will show you a list of definitions (i.e. functions) within the math module.
>
> Similarly, you can also type `?math` to print out some documenation for a module, or `?math.cos` to print out documenation for a specific function.
>
> Beyond this, you should find and consult the online documentation of whatever module or library you are using.

In [None]:
# your solution here


## Regular expressions

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (or "regex") are very powerful, and complicated, tools for searching and manipulating text. If you do extensive processing of complex text, they will be essential.

The `re` module provides regular expression tools for advanced string processing:

In [None]:
import re

In [None]:
regex = r'\bf[a-z]*' # we want to match: any sequence of characters which "is a word" (starts with a word boundary, e.g. space, "\b"), then the letter "f", then any number of letters between a-z
string_to_search = 'which foot or hand fell fastest'

re.findall(regex, string_to_search)

### Exercise

Write your email address as a string variable. Then, separate the username and domain name into two separate strings (without re). Do it again using re.

In [None]:
# your solution here


## Other standard libraries

Many others exist (see [documentation](https://docs.python.org/3/library/index.html)), you will generally discover them when googling with a specific problem or question. They fall into the categories:

* Text processing
* Binary data (including **struct** for reading/writing raw structured binary data)
* Data types (including **OrderedDict**)
* Numeric and math
* Functional programming (other ways of looping, making functions)
* File, directory, and OS access (including **subprocess** to run commands)
* Data persistence (saving/loading/databases)
* Compression (zip)
* File loading (CSV, **configparser**)
* Encryption and hashing
* Concurrent execution (parallel programming, including **threading** and **multiprocessing** - Friday!)
* Networking and internet communication (low-level)
* GUI tools
* Documentation and (unit) testing tools
* Debugging, profiling, and packaging tools

<hr style="border:2px solid #bbb; margin: 30px 0"> </hr>

# Day 2 Practice Problem - Average Temperatures

Use the [data/day2_munich_temps.txt](data/day2_munich_temps.txt) data file, which gives the temperature in Munich every day (one line each), for several years. 

## Task A

Read in the data, and print out the minimum, average, maximum temperature for each year, e.g:

    1995: -3C  10C  35C
    1996: ...
    
Hint: you could use a dictionary to help store values.

In [None]:
# your solution here


## Task B

For the year 2000, print out the same three statistics, averaging over each of the twelve months, e.g:

    January: -15C 0C 3C
    February: ...

In [None]:
# your solution here


## Task C

For each month, print out the same three statistics, but averaging over all of the years available.

In [None]:
# your solution here


<hr style="border:2px solid #bbb; margin: 30px 0"> </hr>

# Day 2 Challenge Problem - IllustrisTNG API

The IllustrisTNG project is a suite of "cosmological" galaxy formation simulations. Each simulation in IllustrisTNG evolves a large swath of a mock Universe from soon after the Big-Bang until the present day while taking into account a wide range of physical processes that drive galaxy formation. The simulations can be used to study a broad range of topics surrounding how the Universe — and the galaxies within it — evolved over time. 

Step 1. Sign up for a [public data access](https://www.tng-project.org/data/) account.

Step 2. Follow the [Data Access API tutorial](https://www.tng-project.org/data/docs/api/) to get familiar with accessing such scientific data using a web-based API.

## Task A

Use the API to search the TNG100-1 simulation at snapshot 99 (redshift zero) for all galaxies with total mass between $10^{12.0} M_\odot$ and $10^{12.2} M_\odot$ (see Task \#2 on the API webpage for hints).

For each, compute and print the gas fraction, defined as the ratio of gas mass to dark matter mass.

In [None]:
# your solution here


## Task B

Download a "particle cutout" of one of the galaxies you have found from the previous task. Open the resulting HDF5 file with `h5py` and examine its contents. Print the number of particles of each type.

In [None]:
# your solution here


<hr style="border:2px solid #bbb; margin: 30px 0"> </hr>

# Day 2 Challenge Problem - Bitcoin Price API

Use the public Coinbase API to download the current exchange rate (i.e. price) for Bitcoin, in different currencies.

The URL:

    https://api.coinbase.com/v2/exchange-rates?currency=BTC
    
provides a JSON response.

## Task A

Create a loop to query this API twenty or thirty times. Between each query, pause the program for three seconds (use `sleep(3.0)`).

For each query, parse the JSON response to obtain the current BTC price in EUR, and save it into a list.

In [None]:
# your solution here


## Task B

We want to test a strategy for trading BTC:
* At the time corresponding to the first data point you have from above, assume you start with "wallet" of 1000 euros.
* Walk forward in time, through the price data series. At each step:
  * if the current price is greater than a factor `(1+frac)` times the previous price, then we take this as a sign the value is increasing, and we "buy". Use your entire wallet balance, if it is in EUR, to buy BTC, i.e. use the current price to convert from EUR to BTC.
  * if the current price is less than a factor `(1+frac)` times the previous price, take this as a sign the value is decreasing, and "sell". Use your entire wallet balance, if it is in BTC, to sell, i.e. use the current price to convert from BTC to EUR.
  
Choose a reasonable value for `frac`.
  
At the end of the time period (when you are out of data), convert your wallet balance into EUR using the final price, if it isn't already. Have you made or lost money?

In [None]:
# your solution here
