About
Brief History
In February 16, 2015, the North Carolina Data Science and Analytics Initiative (NCDSA) was funded through the UNC Research Opportunities Initiative (ROI) as a collaborative research effort between the data science initiatives at the University of North Carolina at Charlotte and North Carolina State University and the Renaissance Computing Institute (RENCI) in Chapel Hill. NCDSA-led research will broaden the economic impact of big data analytics within the state of North Carolina by advancing industry capability and education.
Infrastructure Overview
The NCSU Data Science Resources (NCSU-DSR) is one of three components of a federated data science platform that includes RENCI/UNC-Chapel Hill and UNC-Charlotte.
The NCSU-DSI’s resources were developed to provide NCSU faculty with centralized and shared Data Science resources; the underlying goal being to establish North Carolina as a leader in data analysis and management. NCSU recognizes data science/management as foundational for growth in scholarship and research. As a result, it has responded in several ways to institutionalize Data Science: One of those ways is to provide centralized data science resources. Shared resources reduces the barriers to this goal by
- Reducing administration costs for compliance
- Reducing administration costs for hardware and software maintenance
- Encouraging collaboration and exchange of ideas
Human resources
Patrick Dreher, PhD: Assistant Director, NCSU Data Science Initiative (DSI).
Role: NC State DSI Steering committee member; Coordination and Oversight of NCSU DSI and DSI Infrastructure efforts.
Andrew Petersen, PhD: Data Sciences Research Specialist at NCSU
Role: Liaison and point-of-contact with, and support for the faculty using the resources, and evangelist to prospective faculty.
Aaron Peeler, BSc: Program Manager, IT & VCL, OIT at NCSU.
Role: Technical planning and occasional support.
Josh Thompson, MS: Systems Engineer, OIT, at NCSU.
Role: Technical planning and occasional support.
Physical Infrastructure Basics
Storage
Directly connected to iCat (iRods) server: 300 TB of GPFS disk space
Directly connected to High Performance Computing (HPC) nodes: 100 TB of GPFS disk space.
Servers
1 Lenovo Flex node x240 server: 2 socket, 10 core (40 cores total with hyperthreading), 128G ram
Running Vmware hypervisor hosting the following VMs, in addition to any compute related virtual machines:
- Running iCat (iRods) server (VM): 2 cores, 2GB memory, connected via NFS over 10G to 300 TB disk space
- 4-6 VMs for running iCommands (iRods) client: 2 core, 4GB memory, 25GB of disk space, Centos6 OS
- Cloud Browser
1 Lenovo Flex node x240 server: 2 socket, 10 core (40 cores total with hyperthreading), 512G ram:
- Large memory (512GB) compute node, in the HPC system, for users who need a large shared memory machine for doing analysis, or doing intense graphics, visualization or animation that requires a lot of memory
Software and Tools
iRods Data Grid
Federation with NCSU, UNCCH, UNCC: Distributed and shared storage means that data can be placed close to the computing resources that will do the data processing or analysis, and yet allow inter-institutional collaboration (which is the nature of the ROI projects, since they involve investigators from all 3 institutions)
Workflow with iRods micro-rules, for automating data-processing
Advanced permission administration capabilities: Users can easily define which directories, collections or data sets are shared with which users.
iCloud browser: Allows users to interface with the data grid via a web server running on the iCat (iRods) server. Users can upload, download and examine metadata from anywhere with a web browser.
Command Line or GUI: iCommands (iRods) client, run from a VCL Linux image, allows users to interact with the data grid using GUI’s or command lines.
Virtual Computing Lab
VCL images exist for the iCommands (iRods) client and many data science applications. Users can provision resources within 4 minutes. This is NCSU’s internal cloud, where virtual resources can be built/provisioned on demand. This eliminates users having to install and maintain software on their machines.