I have been trying to remove duplicate rows in a csv file through both R and Excel. However, both keep returning that there are no duplicate rows even though I can clearly see them. I have checked for spaces, different number of characters, capitalization, and they all seem to be exact duplicates. I'm not sure what's going on, but I appreciate any advice on what to try next.
The csv file has column headers with 23700 rows and 7 columns. The labels etc. are random but styled after my own dataset, just shorter. Example:
Sample Class A SubClass B SubClass C SubClass D SubClass E SubClass FSJAKFHL Type A Cya Oxy Syne Cyana Syne_CC867DSKLFHJAS Type A Pro Gamma S_13 Pseudo C_IIISKJDHF Type B Pro Oxy Syne C_47 Syne_CC867ASDFJH Type A Cya Oxy Pseudo Pseudo HacenSJAKFHL Type A Cya Oxy Syne Cyana Syne_CC867JLSDHFSDL Type B Act Acid Actin Cyana C_I
In Excel, I've tried using the =LEN(cell) function and downloading data merging tools, but so far haven't had any luck.
In R, I've tried a few different things, including as a data frame
and as a tibble
. I've been using vroom
instead of readr
because it is faster with large files. The data set I'm working with has been created by merging two others, hence the duplicates. Both of the data sets are formatted exactly the same as above, one with 12,000 rows and the other with 13,000. Some rows were excluded during the merge, which was expected, but other duplicates seem to have stayed. Neither left_join
nor right_join
will work for this.
library(vroom)library(dplyr)library(data.table)data1 <- vroom("data1.csv")data2 <- vroom("data2.csv")merged_data <- full_join(data1, data2)duplicated(merged_data)#returned all FALSE, claimed there were no duplicatesdata_NoDuplicates1 <- duplicated(merged_data)#Thought that maybe it needed to be assigned to a value, but still returned all FALSE, claimed there were no duplicatesdata_NoDuplicates2 <- merged_data[!duplicated(merged_data),]#Had no effectdata_NoDuplicates3 <- distinct(merged_data)#Had no effect
I'm not sure what to try next, or where the problem may be. I've tried experimenting with other ways of merging as well, but so far it's either given me the same result or just not worked.
Any advice is appreciated. Thank you.