Skip to article frontmatterSkip to article content

A few methods of storing datasets are outlined below. The choice of method depends on your preference and the size of the dataset. Keep in mind, regardless of the size of your dataset, each account on DataHub is provided with ~1GB RAM, so this will limit the amount of data that you can read in at any time. If you want to temporarily increase this limit on RAM, please raise a github issue.

Small Datasets (a few MBs)

GitHub

Datasets and the corresponding Jupyter Notebook can be stored in a folder on GitHub. You can then create a nbgitpuller link for the entire folder. When students click this link, the entire folder will appear on their DataHub account.

Outside Hosts

You can store the data on an online host such as Box, Google Drive, or even GitHub.

Direct Upload

Students can directly upload data files to their DataHub account. This method can get messy if notebooks expect the data to be stored at a certain filepath and students upload the files to a different location. Therefore, we recommend using the other methods listed on this page.

Larger Datasets (tens of MBs to several GBs)

Our current recommendation is to keep the file size of the datasets below 100 MB. We recommend the following approaches to all instructors/students who plan to use large datasets for their teaching/learning plans.

Shared directory

In scenarios where you have large datasets or commonly used libraries, a shared directory can serve as a centralized location for these resources. This prevents the need for duplicating files across multiple user spaces, saving disk space and bandwidth.

Shared Read Only Directory: The shared directory is a read-only folder that provides students enrolled in your course with access to course-related datasets and resources. Students will be able to view and read the contents of this folder but will not have permission to make any modifications or perform write operations. These shared directories will be mounted to each user’s file path at /home/jovyan/_shared and will be named based on the course they are associated with. For example, the shared directory for the course Econ 148 will be named econ148-readonly.

Shared ReadWrite Directory The shared directory with read-write access enables instructors to upload, modify and manage files within the folder. Instructors can upload datasets here, and it will automatically synced to the corresponding “shared-readonly” directory, providing students with read-only permissions to the added content. The folder will be mounted to /home/jovyan/_shared and will be named based on the course it belongs. Eg: econ148-readwrite

Create a Github Issue if you want shared directories enabled for your course. You need to provide the bcourses id for your course and the DataHub URL so that the shared directories appear on the hub you use with appropriate permissions for the folks enrolled in your course roster in bcourses.

SyncThing

SyncThing is an application that allows users to share their files/folders with their collaborators through a dropox like functionality. You can store all your data in the SyncThing folder and share it with your collaborators. They can read data from the application into their Jupyter notebooks. Refer to this documentation that explains the approach to share files via SyncThing.

Outside Hosts

You can store the data on an online host such as Box, Google Drive, or even GitHub. The datascience package contains a [read_table()](http://data8.org/datascience/_autosummary/datascience.tables.Table.read_table.html#datascience.tables.Table.read_table)) function for the [Tables](http://data8.org/datascience/tables.html)) data structure. This function will load the data from a given URL.