Spampinato mice retina mea252ch pair recording - 1¶
Part 1) Cleaning the ground-truth data¶
This set of notebooks the dataset is from paired juxtacellular/extracellular recordings from mice retina in vitro. The MEA has 252 channels.
The official publication of this open dataset can be found at the following address: https://zenodo.org/record/1205233#.W9mq1HWLTIF
These datasets were used by Pierre Yger et al in the following "spyking circus" paper: https://elifesciences.org/articles/34518
After inspecting the juxta-cellular data, we found that some recordings don't have a good enough quality to be considered as "ground truth". To be "ground truth", a unit is required to be stable in the detection, peak signal-to-noise ratio (SNR) and amplitude.
At the end of our quality assessment, some files are removed for the main study shown in "spampinato-mice-retina-mea252ch-pair-recording-part2".
Quality assessment details¶
First, we have to run the script
- unzips the downloaded data
- runs a juxta cellular detection
- generates figure to manually check juxtacellular quality
- computes the peak SNR on the max channel of the MEA.
Before running the script, we need:
- to create a folder basedir
- to create a subfolder basedir/original_files that contain all zip downloded (20160415_patch2.tar.gz, ...)
Then we can run the script
After we can:
- inscpect in each folder explanatory figures.
Author: Samuel Garcia, CRNL, Lyon
Criterium to keep or remove a file¶
Having a very reliable ground truth is crucial, as all the following spike sorting performance metrics are designed on the hypothesis the ground truth is indeed ground truth.
In the following script we choose a high threshold value for peak detection: thresh = med + 8*mad, where:
- med is the median of the signal (the baseline),
- mad is the median absolut deviation (a robust std estimation),
- 8 is a quite high relative threshold that ensures the absence of false positive.
Two main criteria were used to keep a recording:
- the distribution of the peak values of the juxtacelullar action potentials must have a Gaussian distribution:
- a truncated Gaussian suggests that false negative (misses) corrupt the "ground truth",
- a multi-modal distribution suggests either that an amplitude drift occured or that two (or more) cells were present.
List of accepted recording (8)¶
'20160415_patch2', '20170803_patch1', '20160426_patch3', '20170725_patch1', '20170621_patch1', '20160426_patch2', '20170728_patch2', '20170713_patch1',
List of rejected recording (11)¶
'20170706_patch2' '20170629_patch2' '20170622_patch2' '20170726_patch1' '20170706_patch1' '20170706_patch3' '20170627_patch1' '20170630_patch1' '20170629_patch3' '20170623_patch1' '20170622_patch1'
(Some reader may think that we are too strict, but we prefer to be strict to ensure safe final results. Feel free to modify this list as you prefer using your own criteria.)
import matplotlib import os, shutil import zipfile, tarfile import re import numpy as np import pandas as pd import matplotlib.pyplot as plt # path basedir = '/media/samuel/dataspikesorting/DataSpikeSortingHD2/Pierre/zenodo/' recording_folder = basedir + 'original_files/' ground_truth_folder = basedir + 'ground_truth/' %matplotlib notebook
# this tridesclous utils are imported only for juxta detection to keep this script simple from tridesclous.peakdetector import detect_peaks_in_chunk from tridesclous.tools import median_mad from tridesclous.waveformtools import extract_chunks
rec_names = ['20170629_patch3', '20170728_patch2', '20170630_patch1', '20160426_patch2', '20170621_patch1', '20170627_patch1', '20170706_patch3', '20170706_patch1', '20170726_patch1', '20170725_patch1', '20160426_patch3', '20170622_patch1', '20170623_patch1', '20170622_patch2', '20170629_patch2', '20170713_patch1', '20160415_patch2', '20170706_patch2', '20170803_patch1']
# this unzip all files into recording_folder for rec_name in rec_names: filename = recording_folder + rec_name + '.tar.gz' if os.path.exists(recording_folder+rec_name) and os.path.isdir(recording_folder+rec_name): continue print('unzip', rec_name) t = tarfile.open(filename, mode='r|gz') t.extractall(recording_folder+rec_name)
if not os.path.exists(ground_truth_folder): os.mkdir(ground_truth_folder) gt_info = pd.DataFrame(index=rec_names) for rec_name in rec_names: print('detect_juxta: ', rec_name) # get juxta signal dirname = recording_folder + rec_name + '/' for f in os.listdir(dirname): if f.endswith('juxta.raw'): juxta_filename = dirname + f juxta_sig = np.memmap(juxta_filename, dtype='float32') # get mea signals for f in os.listdir(dirname): if f.endswith('.raw') and not f.endswith('juxta.raw'): mea_filename = dirname + f with open(mea_filename.replace('.raw', '.txt'), mode='r') as f: offset = int(re.findall('padding = (\d+)', f.read())) mea_sigs = np.memmap(mea_filename, dtype='uint16', offset=offset).reshape(-1, 256) print(1) # select only the 252 mea channel (see PRB file) mea_sigs = mea_sigs[:, list(range(126)) + list(range(128,254))] print(2) gt_folder = ground_truth_folder + rec_name + '/' os.mkdir(gt_folder) # detect spikes med, mad = median_mad(juxta_sig) print(3) thresh = med + 8*mad gt_indexes = detect_peaks_in_chunk(juxta_sig[:, None], k=10,thresh=thresh, peak_sign='-') gt_indexes = gt_indexes.astype('int64') gt_indexes.tofile(gt_folder+'juxta_peak_indexes.raw') print(4) # save some figures to for visual cheking sr = 20000. times = np.arange(juxta_sig.size) / sr fig, ax = plt.subplots() ax.plot(times, juxta_sig) ax.plot(times[gt_indexes], juxta_sig[gt_indexes], ls='None', color='r', marker='o') ax.set_xlim(0, 10) ax.axhline(-thresh, color='k', ls='--') ax.set_title('juxta detection - ' + rec_name) fig.savefig(gt_folder+'juxta detection.png') fig, ax = plt.subplots() count, bins = np.histogram(juxta_sig[gt_indexes], bins=np.arange(np.min(juxta_sig[gt_indexes]), 0, 0.5)) ax.plot(bins[:-1], count) ax.axvline(-thresh, color='k', ls='--') ax.set_title('juxta peak amplitude - ' + rec_name) fig.savefig(gt_folder+'juxta peak amplitude.png') print(5) # extract waveforms with only 150 peaks to minimize RAM n_left, n_right = -45, 60 some_gt_indexes = np.random.choice(gt_indexes, size=150) waveforms = extract_chunks(mea_sigs, some_gt_indexes+n_left, n_right-n_left) wf_median, wf_mad = median_mad(waveforms, axis=0) print(6) # get on wich channel the max is and the value max_on_channel = np.argmin(np.min(wf_median, axis=0), axis=0) # get the MAD (robust STD) on the mea signal # this estimate the SNR mea_median, mea_mad = median_mad(mea_sigs[:, max_on_channel] , axis=0) baseline = mea_median print(7) peak_value = np.min(wf_median[:, max_on_channel]) peak_value = peak_value- baseline peak_snr = np.abs(peak_value/mea_mad) # evrything in Dataframe gt_info.at[rec_name, 'nb_spike'] = gt_indexes.size gt_info.at[rec_name, 'max_on_channel'] = max_on_channel gt_info.at[rec_name, 'peak_value'] = peak_value gt_info.at[rec_name, 'peak_snr'] = peak_snr gt_info.at[rec_name, 'noise_mad'] = mea_mad fig, ax = plt.subplots() ax.plot(wf_median.T.flatten()) fig.savefig(gt_folder+'GT waveforms flatten.png') fig, ax = plt.subplots() ax.plot(wf_median) ax.axvline(-n_left) fig.savefig(gt_folder+'GT waveforms.png') print(8) gt_info.to_excel(ground_truth_folder+'gt_info.xlsx')
# 2 simple functions def get_juxta_filename(rec_name): # find the juxta file dirname = recording_folder + rec_name + '/' for f in os.listdir(dirname): if f.endswith('juxta.raw'): juxta_filename = dirname + f return juxta_filename def plot_juxta_amplitude(rec_name): juxta_filename = get_juxta_filename(rec_name) juxta_sig = np.memmap(juxta_filename, dtype='float32') med = np.median(juxta_sig) mad = np.median(np.abs(juxta_sig-med))*1.4826 thresh = med + 8*mad gt_indexes = ground_truth_folder + 'juxta_peak_indexes.raw' gt_indexes = np.fromfile(ground_truth_folder + rec_name + '/juxta_peak_indexes.raw', dtype='int64') gt_amplitudes = juxta_sig[gt_indexes] fig, axs = plt.subplots(nrows=2) count, bins = np.histogram(gt_amplitudes, bins=np.arange(np.min(juxta_sig[gt_indexes]), 0, 0.5)) ax = axs ax.plot(bins[:-1], count) ax.axvline(-thresh, color='r', ls='--') ax.axvline(med, color='k', ls='-') for i in range(1,6): ax.axvspan(med - i * mad, med + i * mad, color='k', alpha=0.05) fig.suptitle('juxta peak amplitude - ' + rec_name) ax = axs ax.plot(gt_indexes, gt_amplitudes, ls='None', marker='o') ax.axhline(-thresh, color='r', ls='--') for i in range(1,6): ax.axhspan(med - i * mad, med + i * mad, color='k', alpha=0.05)
Why some recordings are are not kept?¶
In the following figures:
- the black vertical line is the baseline (median) of juxta-cellular trace,
- the grey areas represent 1, 2, 3, 4, 5 MAD (robust STD),
- the red line is the detection threshold.
Figure for 20170629_patch2¶
Here the peak amplitude distribution crosses the detection threhold. Missed events are obvious in the middle part of recording.
Figure for 20170622_patch2¶
Here the peak amplitude distribution crosses the detection threhold and too few events got detected.
Figure for 20170726_patch1¶
Here again the peak amplitude distribution crosses the detection threhold. Some spikes are clearly missed at the begnning.
Figure for 20170630_patch1¶
Suspicion of missing spikes at the beggining and at the end of recording.
Figure for 20170623_patch1 : NO¶
Here the amplitude distribution right tail is too close to the detection threhold and there is a suspicion of missed spikes.
Figure for 20170713_patch1 : OK, but boundary¶
We see here two clear peaks in the distribution suggesting that there could be an electrode movement.
Figure for 20170725_patch1 : OK but bundary¶
A ground truth we can trust, but there is a suspicious change in amplitude in the middle of recording.
- 11 out of 19 files have been removed for further ground-truth analysis.
- 8 out of 19 files are kept for ground-truth analysis.
For paired recording ground truth, the ground truth itself has to be carefully verified.
The original ground spike index provided on 19 files by the authors are not trustable for a fair spike sorting comparison.